lineage fixes: numpy, DataFrame, specs #44

vidma · 2021-09-21T07:28:18Z

don't report ndarray schema, e.g. array(int, int, int, int, ... <50000 times more>)
improve lineage:
- fix old add_dependencies to use newer methods underneath
- report all2all lineage where appropriate
- if we have a list[items] report only first one
- if we have a list[list[items]] report only first one ! e.g. list(ndarray[ndarray[..]])
add spec for pandas+numpy

This reverts commit f8544f1.

vidma · 2022-12-12T16:04:41Z

tests/unit/test_pandas_numpy.py

+    def assert_msg_exists(self, msg, msg2=None):
+        with open(self.offline_file, "r") as f:
+            assert bool([True for l in f.readlines() if msg in l and (msg2 is None or msg2 in l)])
+
+    def assert_ndarray(self, v):
+        assert str(type(v)) == "<class 'kensu.numpy.ndarray'>"
+
+    def test_df_indexed_value(self):
+        out_fname1 = 'test_df_column.indexer.values'
+        out_fname = 'test_df_indexed_list_of_ndarray'
+        dummy_ts_df = pd.read_csv('https://raw.githubusercontent.com/numenta/NAB/master/data/realTweets/Twitter_volume_AMZN.csv', index_col=0) # header=0,
+        # <class 'kensu.pandas.data_frame.DataFrame'>
+
+        l1 = []
+        # here `value` is a column name!
+        value_series = dummy_ts_df.value[:"2015-04-05 00:00:00"]  # <class 'kensu.pandas.data_frame.Series'>
+        # probably not supporting passing list(Series) to constructor of ndarray?
+        v = value_series.values  # ndarray
+        self.assert_ndarray(v)
+        ndarray_to_csv(v, out_fname1)
+        self.assert_msg_exists('Lineage to kensu-py/test_df_column.indexer.values from realTweets/Twitter_volume_AMZN.csv',
+                               '"columnDataDependencies": {"0": ["value"]}}')
+


example of integration test "pattern":

kensu-py collector writes to offline ingestion jsonl file as usual with report_to_file=True

test helper in assert_msg_exists read offline ingestion jsonl file

we check for certain patterns in the jsonl file (here just if certain line or string exists - seems enough for now, in spark-collector it's fancier - parse and transform jsonl)

vidma added 14 commits September 21, 2021 10:23

refactor

3933075

create_dependencies non recursive

8778fcc

Revert "create_dependencies non recursive"

245b244

This reverts commit f8544f1.

ndarray: log not tracked methods

85689b6

wip

0588bd5

numpy.array fixes

9d8757e

add_dependency to use dependency_mapping

f3fa995

debug in-mem inputs lineage (those inputs are without inputs)

e4b9eda

wip: pd.DataFrame.{value|values}

c37c94b

add pandas+numpy spec & fix some bugs

1c9f3ef

fix column lineage numpy->pd.DataFrame

e2a4b1e

check lineage in spec

c57b40e

wip: more specs (not working)

068261a

more specs

d196b3a

vidma commented Dec 12, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lineage fixes: numpy, DataFrame, specs #44

lineage fixes: numpy, DataFrame, specs #44

vidma commented Sep 21, 2021 •

edited

Loading

vidma Dec 12, 2022

lineage fixes: numpy, DataFrame, specs #44

Are you sure you want to change the base?

lineage fixes: numpy, DataFrame, specs #44

Conversation

vidma commented Sep 21, 2021 • edited Loading

vidma Dec 12, 2022

Choose a reason for hiding this comment

vidma commented Sep 21, 2021 •

edited

Loading