Add third data quality metric #11939

marthasharkey · 2024-12-24T11:11:29Z

Pull Request Description

closes Add Data Quality indicator to columns in the Table Viz #6332

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

The documentation has been updated, if necessary.
Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
All code follows the
Scala,
Java,
TypeScript,
and
Rust
style guides. In case you are using a language not listed above, follow the Rust style guide.
Unit tests have been written where possible.
If meaningful changes were made to logic or tests affecting Enso Cloud integration in the libraries,
or the Snowflake database integration, a run of the Extra Tests has been scheduled.
- If applicable, it is suggested to paste a link to a successful run of the Extra Tests.

…nto wip/mk/dq-metric

jdunkerley

Some tweaks for efficiency but like the approach.

jdunkerley · 2025-01-08T14:57:13Z

std-bits/table/src/main/java/org/enso/table/data/column/operation/CountWhitespace.java

+import org.enso.table.data.table.Column;
+import org.graalvm.polyglot.Context;
+
+public class CountWhitespace {


Let's create a base class (something like SampledOperation), which would hold the RANDOM_SEED and DEFAULT_SAMPLE_SIZE.

CountWhitespace feels the wrong name, probably should be CountNonTrivialWhitespace.

jdunkerley · 2025-01-08T14:59:57Z

std-bits/table/src/main/java/org/enso/table/data/column/storage/StringStorage.java

  private Future<Long> untrimmedCount;
+  private Future<Long> whitespaceCount;


Lets create a record type here.

record DataQualityMetrics(Long untrimmedCount, Long whitespaceCount);

We can then do it as sinlge CompletableFuture to compute the values.

jdunkerley · 2025-01-08T15:01:51Z

std-bits/base/src/main/java/org/enso/base/Text_Utils.java

+    List<String> trivialWhiteSpaceList =
+        List.of(
+            "\u200A", "\u200B", "\u205F", "\u2004", "\u2005", "\u2006", "\u2008", "\u2009",
+            "\u2007", "\r", "\n", "\t", "\u2002", "\u00A0", "\u3000", "\u2003");


Set<Char> should work here I think.

The variable name makes me think it is good not bad.

We can then loop over the characters of the String and return true if the set contains any of them.

I was also confused by the variable name, shouldn't it be nonTrivialWhiteSpace instead?

Also, I imagine the JVM is good at optimizing such things, but perhaps we will make it easier for it if we make the list (or Set) a private static final field ensuring that it is computed only once upon initialization and not on every invocation of this function (since this function may be invoked millions of times).

I imagine the JIT would fold this constant at some point, but making this a constant from the beginning maybe would just be better.

Other thing - how do we know that this list is comprehensive?

Checking e.g. https://jkorpela.fi/chars/spaces.html suggests that we are missing e.g. the NARROW NO-BREAK SPACE U+202F or U+180E - the "MONGOLIAN VOWEL SEPARATOR" 😅

I think that as long as we can reasonably avoid it, I'd rather be against including such lists of constants in our code - it is easy for them to miss some obscure constants, and they are a bit less future proof. If instead we rely on ICU4J or the Java SDK itself, if there are new Unicode versions released in the future we'd just need to rely on them to get updated to the latest version instead of having to update many of such constant lists manually (which we will probably never do unless users start reporting bugs).

Of course things like "MONGOLIAN VOWEL SEPARATOR" are probably not that commonly used in most cases and missing them probably won't be a huge deal for a generic metric. But if we can use existing dependencies that we rely on anyway, I'd suggest preferring that.

Thus we could rewrite the code to essentially:

iterate over every character in the text,

check if it is any whitespace using UCharacter::isUWhiteSpace,

if it is whitespace - then check if it is 'non-trivial' - a non trivial whitespace is any whitespace that is not the " " character.

for (...) { char c = str.charAt(i); if (UCharacter::isUWhiteSpace(c)) { boolean isNonTrivial = c != ' '; if (isNonTrivial) { return true; } } context.safepoint(); }

Ah sorry small correction - after further reading perhaps the two examples I provided above may actually not be classified as whitespace by the Unicode standard... So perhaps the List was good.

But I think we can still simplify the code by relying on isUWhiteSpace. It will also likely be much faster as it relies on bit-fiddling and boils down to 1 bit check per character instead of 16 independent text searches.

Yeah I agree not relying on a list seems more reliable and more efficient! I didnt know we already have the isUWhitespace elsewhere so thanks for pointing it out!

jdunkerley · 2025-01-08T15:03:04Z

distribution/lib/Standard/Visualization/0.0.0-dev/src/Table/Visualization.enso

        number_untrimmed = case all_rows_count > Column.default_sample_size of
            False -> JS_Object.from_pairs [["name", "Count untrimmed whitespace"], ["percentage_value", columns.map .count_untrimmed]]
            True -> JS_Object.from_pairs [["name", "Count untrimmed whitespace (sampled)"], ["percentage_value", columns.map .count_untrimmed]]
-        [number_nothing, number_untrimmed]
+        number_non_triv = case all_rows_count > Column.default_sample_size of
+            False -> JS_Object.from_pairs [["name", "Count non trivial whitespace"], ["percentage_value", columns.map .count_non_trivial_whitespace]]
+            True -> JS_Object.from_pairs [["name", "Count non trivial whitespace (sampled)"], ["percentage_value", columns.map .count_non_trivial_whitespace]]
+        JS_Object.from_pairs 
+        [number_nothing, number_untrimmed, number_non_triv]


Let's change it so we check the count only once. Tiny nit.

name_extra = if all_rows_count > Column.default_sample_size then " (sampled)" else ""

Yeah it would be much better if we go the name_extra route and avoid the duplication in all the percentage_value entries.

GregoryTravis · 2025-01-08T15:55:28Z

distribution/lib/Standard/Table/0.0.0-dev/src/Column.enso

+      Used for data quality indicator in Table Viz.
+    count_non_trivial_whitespace : Integer -> Integer | Nothing
+    count_non_trivial_whitespace self sample_size:Integer=Column.default_sample_size =
+        if (self.value_type == Value_Type.Mixed || self.value_type.is_text).not then Nothing else


Suggested change

if (self.value_type == Value_Type.Mixed || self.value_type.is_text).not then Nothing else

if self.value_type.is_text.not then Nothing else

I think Mixed implies is_text.not

No but this suggestion is not equivalent:

(self.value_type == Value_Type.Mixed || self.value_type.is_text).not === self.value_type != Value_Type.Mixed && self.value_type.is_text.not !== self.value_type.is_text.not

The counter-example is a mixed type column - self.value_type == Value_Type.Mixed.

Then the first and the equivalent second expression will evaluate to False but your suggested (third) expression evaluates to True.

I guess it becomes clearer if the condition is named:

Suggested change

if (self.value_type == Value_Type.Mixed || self.value_type.is_text).not then Nothing else

is_eligible_for_whitespace_count = self.value_type == Value_Type.Mixed || self.value_type.is_text

if is_eligible_for_whitespace_count.not then Nothing else

I have pulled this out into a function as the same check is done in the other data quality count

I have pulled this out into a function as the same check is done in the other data quality count

Sounds great

GregoryTravis · 2025-01-08T15:58:50Z

distribution/lib/Standard/Visualization/0.0.0-dev/src/Table/Visualization.enso

        number_untrimmed = case all_rows_count > Column.default_sample_size of
            False -> JS_Object.from_pairs [["name", "Count untrimmed whitespace"], ["percentage_value", columns.map .count_untrimmed]]
            True -> JS_Object.from_pairs [["name", "Count untrimmed whitespace (sampled)"], ["percentage_value", columns.map .count_untrimmed]]
-        [number_nothing, number_untrimmed]
+        number_non_triv = case all_rows_count > Column.default_sample_size of
+            False -> JS_Object.from_pairs [["name", "Count non trivial whitespace"], ["percentage_value", columns.map .count_non_trivial_whitespace]]
+            True -> JS_Object.from_pairs [["name", "Count non trivial whitespace (sampled)"], ["percentage_value", columns.map .count_non_trivial_whitespace]]
+        JS_Object.from_pairs 
+        [number_nothing, number_untrimmed, number_non_triv]


name_extra = if all_rows_count > Column.default_sample_size then " (sampled)" else ""

marthasharkey · 2025-01-09T14:44:49Z

std-bits/base/src/main/java/org/enso/base/Text_Utils.java

   * @return whether the string has leading or trailing whitespace
   */
-  public static boolean has_leading_trailing_whitespace(String s) {


This was initially written in enso then the logic copied over, now as we are in a java file I have written this how I feel is simpler, happy to revert if the original is preferred

radeusgd · 2025-01-09T15:38:03Z

std-bits/base/src/main/java/org/enso/base/Text_Utils.java

+    String trimmedString = initialString.trim();
+    return trimmedString.length() != initialString.length();


The earlier algorithm was working in constant memory. Now it may allocate O(N) (as big as the input string) additional memory (just to throw it away).

I know we shouldn't optimize prematurely but this amount of copying seems slightly concerning if it may be done on huge values.

distribution/lib/Standard/Table/0.0.0-dev/src/Column.enso

GregoryTravis · 2025-01-15T17:13:39Z

std-bits/table/src/main/java/org/enso/table/data/column/storage/StringStorage.java

-          CompletableFuture.completedFuture(
-              CountUntrimmed.compute(
-                  this, CountUntrimmed.DEFAULT_SAMPLE_SIZE, Context.getCurrent()));
+      dataQualityMetricsValues =


The code that constructs dataQualityMetricsValues is repeated in this file -- it would be good to separate it out into a helper.

GregoryTravis

Looks good, just one suggestion about factoring out some common code.

Co-authored-by: Gregory Michael Travis <[email protected]>

jdunkerley

Agree with @GregoryTravis comment would be worth extracting method to make the DataQualityMetrics. Could be a static in the record type taking the StringStorage.

jdunkerley · 2025-01-16T13:36:32Z

distribution/lib/Standard/Table/0.0.0-dev/src/Column.enso

+    can_contain_text : Value_Type -> Boolean
+    is_eligible_for_text_data_metric_count value_type = 


Suggested change

can_contain_text : Value_Type -> Boolean

is_eligible_for_text_data_metric_count value_type =

private is_eligible_for_text_data_metric_count value_type:Value_Type -> Boolean =

jdunkerley · 2025-01-16T13:40:18Z

...bits/table/src/main/java/org/enso/table/data/column/operation/CountNonTrivialWhitespace.java

+  }
+
+  /** Internal method performing the calculation on a storage. */
+  public static long compute(ColumnStorage storage, long sampleSize, Context context) {


Lets get this merged but we can put the loop code here into SampleOperation and call with a delegate to perform the operation.

marthasharkey added 17 commits December 17, 2024 10:05

working with new line

79ff2f5

update tests

975f6db

extend whitespace to check

1263900

Merge branch 'develop' into wip/mk/dq-metric

880b442

add sample test back and sample new dqm

20e3cdf

test passing unsampled

3b9ed52

working with sampling

848d733

simplify method

2b08006

Merge branch 'develop' into wip/mk/dq-metric

3dd76d6

reformatjava files

b6a032f

reformat Java code

de6d76b

reformat java files

7062247

update comments

75a5ac3

update comment

d38a3ca

Merge branch 'develop' into wip/mk/dq-metric

11f635f

Merge branch 'wip/mk/dq-metric' of https://github.com/enso-org/enso i…

991faea

…nto wip/mk/dq-metric

Merge branch 'develop' into wip/mk/dq-metric

3d7d6b0

marthasharkey marked this pull request as ready for review January 8, 2025 13:35

marthasharkey requested review from jdunkerley, radeusgd, GregoryTravis and AdRiley as code owners January 8, 2025 13:35

fix tests

02ae856

jdunkerley requested changes Jan 8, 2025

View reviewed changes

GregoryTravis reviewed Jan 8, 2025

View reviewed changes

marthasharkey added 5 commits January 8, 2025 16:56

simplift enso viz code

7b0b709

rename

fa48936

add sampleoperation class

159e387

use record and one computable future

987ecc6

update functions and run formatter

eb045bb

marthasharkey added 2 commits January 9, 2025 14:08

ignore regular space for ws metric

ac4afdb

pull out is_text or mixed check

b74c31b

marthasharkey commented Jan 9, 2025

View reviewed changes

run javafmtAll

90c7303

radeusgd reviewed Jan 9, 2025

View reviewed changes

revert change

edba289

GregoryTravis reviewed Jan 15, 2025

View reviewed changes

GregoryTravis approved these changes Jan 15, 2025

View reviewed changes

marthasharkey and others added 2 commits January 16, 2025 13:34

Update distribution/lib/Standard/Table/0.0.0-dev/src/Column.enso

d8d01a4

Co-authored-by: Gregory Michael Travis <[email protected]>

Merge branch 'develop' into wip/mk/dq-metric

eed7745

jdunkerley approved these changes Jan 16, 2025

View reviewed changes

marthasharkey added 5 commits January 16, 2025 14:20

rename and pull out to seperate function

675f766

run javafmtAll

694dbb1

Merge branch 'develop' into wip/mk/dq-metric

c16b796

Merge branch 'develop' into wip/mk/dq-metric

c7b1b37

Merge branch 'develop' into wip/mk/dq-metric

22694d1

marthasharkey added the CI: No changelog needed Do not require a changelog entry for this PR. label Jan 24, 2025

marthasharkey added 5 commits January 28, 2025 17:32

Merge branch 'develop' into wip/mk/dq-metric

2ee675f

fix merge

917354d

fix tests and merge

4607ce6

run java formatter

1c45bd0

Merge branch 'develop' into wip/mk/dq-metric

45fb321

marthasharkey added the CI: Ready to merge This PR is eligible for automatic merge label Jan 30, 2025

mergify bot merged commit 96b3a97 into develop Jan 30, 2025
45 checks passed

mergify bot deleted the wip/mk/dq-metric branch January 30, 2025 09:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add third data quality metric #11939

Add third data quality metric #11939

marthasharkey commented Dec 24, 2024 •

edited

Loading

jdunkerley left a comment

jdunkerley Jan 8, 2025

jdunkerley Jan 8, 2025

jdunkerley Jan 8, 2025

jdunkerley Jan 8, 2025

radeusgd Jan 8, 2025

radeusgd Jan 8, 2025

radeusgd Jan 8, 2025 •

edited

Loading

radeusgd Jan 8, 2025

marthasharkey Jan 9, 2025

jdunkerley Jan 8, 2025

GregoryTravis Jan 8, 2025

radeusgd Jan 8, 2025

GregoryTravis Jan 8, 2025

radeusgd Jan 8, 2025

radeusgd Jan 8, 2025

marthasharkey Jan 9, 2025

radeusgd Jan 17, 2025

GregoryTravis Jan 8, 2025

marthasharkey Jan 9, 2025

radeusgd Jan 9, 2025

GregoryTravis Jan 15, 2025

GregoryTravis left a comment

jdunkerley left a comment

jdunkerley Jan 16, 2025

jdunkerley Jan 16, 2025

		private Future<Long> untrimmedCount;
		private Future<Long> whitespaceCount;

	if (self.value_type == Value_Type.Mixed \|\| self.value_type.is_text).not then Nothing else
	if self.value_type.is_text.not then Nothing else

	if (self.value_type == Value_Type.Mixed \|\| self.value_type.is_text).not then Nothing else
	is_eligible_for_whitespace_count = self.value_type == Value_Type.Mixed \|\| self.value_type.is_text
	if is_eligible_for_whitespace_count.not then Nothing else

		String trimmedString = initialString.trim();
		return trimmedString.length() != initialString.length();

		can_contain_text : Value_Type -> Boolean
		is_eligible_for_text_data_metric_count value_type =

	can_contain_text : Value_Type -> Boolean
	is_eligible_for_text_data_metric_count value_type =
	private is_eligible_for_text_data_metric_count value_type:Value_Type -> Boolean =

Add third data quality metric #11939

Add third data quality metric #11939

Conversation

marthasharkey commented Dec 24, 2024 • edited Loading

Pull Request Description

Important Notes

Checklist

jdunkerley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

radeusgd Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GregoryTravis left a comment

Choose a reason for hiding this comment

jdunkerley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marthasharkey commented Dec 24, 2024 •

edited

Loading

radeusgd Jan 8, 2025 •

edited

Loading