Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lineage fixes: numpy, DataFrame, specs #44

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from
Draft

Conversation

vidma
Copy link
Contributor

@vidma vidma commented Sep 21, 2021

  • don't report ndarray schema, e.g. array(int, int, int, int, ... <50000 times more>)
  • improve lineage:
    • fix old add_dependencies to use newer methods underneath
    • report all2all lineage where appropriate
    • if we have a list[items] report only first one
    • if we have a list[list[items]] report only first one ! e.g. list(ndarray[ndarray[..]])
  • add spec for pandas+numpy

Comment on lines +49 to +71
def assert_msg_exists(self, msg, msg2=None):
with open(self.offline_file, "r") as f:
assert bool([True for l in f.readlines() if msg in l and (msg2 is None or msg2 in l)])

def assert_ndarray(self, v):
assert str(type(v)) == "<class 'kensu.numpy.ndarray'>"

def test_df_indexed_value(self):
out_fname1 = 'test_df_column.indexer.values'
out_fname = 'test_df_indexed_list_of_ndarray'
dummy_ts_df = pd.read_csv('https://raw.githubusercontent.com/numenta/NAB/master/data/realTweets/Twitter_volume_AMZN.csv', index_col=0) # header=0,
# <class 'kensu.pandas.data_frame.DataFrame'>

l1 = []
# here `value` is a column name!
value_series = dummy_ts_df.value[:"2015-04-05 00:00:00"] # <class 'kensu.pandas.data_frame.Series'>
# probably not supporting passing list(Series) to constructor of ndarray?
v = value_series.values # ndarray
self.assert_ndarray(v)
ndarray_to_csv(v, out_fname1)
self.assert_msg_exists('Lineage to kensu-py/test_df_column.indexer.values from realTweets/Twitter_volume_AMZN.csv',
'"columnDataDependencies": {"0": ["value"]}}')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example of integration test "pattern":

  • kensu-py collector writes to offline ingestion jsonl file as usual with report_to_file=True
  • test helper in assert_msg_exists read offline ingestion jsonl file
  • we check for certain patterns in the jsonl file (here just if certain line or string exists - seems enough for now, in spark-collector it's fancier - parse and transform jsonl)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

1 participant