Skip to content

Wrong Documentation "Run a Validation Definition" #11352

@HoergerL

Description

@HoergerL

Describe the bug
I am completely new to Great Expectation, so it might be that I did sth wrong on my side. I have been following the documentation exactly and arrived at this step:
https://docs.greatexpectations.io/docs/core/run_validations/run_a_validation_definition. I don't really understand what this part does:
batch_parameters_dataframe = {"dataframe": dataframe}
batch_parameters_daily = {"year": "2020", "month": "1", "day": "17"}
batch_parameters_yearly = {"year": "2019"}

validation_results = validation_definition.run(batch_parameters=batch_parameters_yearly)

but for me it doesn't look intended. the batch_parameters_dataframe and batch_parameters_daily are not used at all and if I execute the code like this, I get the following error "BuildBatchRequestError: Bad input to build_batch_request: options must contain exactly 1 key, 'dataframe'.". I tried around and switched it to this, which run successfully: validation_results = validation_definition.run(batch_parameters=batch_parameters)
Could you either explain better in the documentation what this is about and how to use it or adjust it to running code? This makes it really hard for beginners to get started.

To Reproduce
Please include your great_expectations.yml config, the code you’re executing that causes the issue, and the full stack trace of any error(s).

Code:

import great_expectations as gx

context = gx.get_context(mode="file")
print(type(context).__name__)
# Retrieve the dataframe Batch Definition


# data_source = context.data_sources.add_spark(name=data_source_name)
try:
    data_source = context.data_sources.get(data_source_name)
except:
    # Datasource doesn't exist yet; add it
    data_source = context.data_sources.add_spark(name=data_source_name)

# data_asset = data_source.add_dataframe_asset(name=data_asset_name)
try:
    data_asset = data_source.get_asset(data_asset_name)
except:
    data_asset = data_source.add_dataframe_asset(name=data_asset_name)

batch_definition_name = "my_batch_definition"

# Add a Batch Definition to the Data Asset
# wird oben erstellt, theoretisch unnötig hier
try:
    batch_definition = data_asset.get_batch_definition(batch_definition_name)
except:
    batch_definition = data_asset.add_batch_definition_whole_dataframe(batch_definition_name) # means that the batch definition contains the complete dataframe and not parts of it


batch_definition = (
    context.data_sources.get(data_source_name)
    .get_asset(data_asset_name)
    .get_batch_definition(batch_definition_name)
)


dataframe = spark.sql(f"Select * from {db_name}.{table_name}")

print(type(dataframe))

# Create an Expectation to test
expectation = gx.expectations.ExpectColumnValuesToBeBetween(
    column="recid", max_value=6, min_value=1
)

# Get the dataframe as a Batch
batch_parameters = {"dataframe": dataframe}

batch = batch_definition.get_batch(batch_parameters=batch_parameters)

# print(batch.head())

# Test the Expectation
validation_results = batch.validate(expectation)
print(validation_results)

# organize expectations into an expectation suite
suite_name = "my_expectation_suite"
suite = gx.ExpectationSuite(name=suite_name)

try: 
    suite = context.suites.add(suite)
except:
    print("suite already exists")
existing_suite_name = (
    "my_expectation_suite"  # replace this with the name of your Expectation Suite
)
suite = context.suites.get(name=existing_suite_name)

suite.add_expectation(expectation)
expectation.column = "pickup_location_id"
expectation.save()

# create a validation definition

# Retrieve an Expectation Suite
# expectation_suite_name = "my_expectation_suite"
# expectation_suite = context.suites.get(name=expectation_suite_name)

# Retrieve a Batch Definition
# batch_definition_name = "my_batch_definition"
# batch_definition = (
#     context.data_sources.get(data_source_name)
#     .get_asset(data_asset_name)
#     .get_batch_definition(batch_definition_name)
# )

# Create a Validation Definition
definition_name = "my_validation_definition"
validation_definition = gx.ValidationDefinition(
    data=batch_definition, suite=suite, name=definition_name
)

# Add the Validation Definition to the Data Context
try:
    validation_definition = context.validation_definitions.add(validation_definition)
except:
    print("validation_definition already exists")


# retreiving exepectation suite
# existing_suite_name = (
#     "my_expectation_suite"  # replace this with the name of your Expectation Suite
# )
# suite = context.suites.get(name=existing_suite_name)

import great_expectations as gx


# Retrieve the Validation Definition
validation_definition_name = "my_validation_definition"
validation_definition = context.validation_definitions.get(validation_definition_name)

print(validation_definition)

# Define Batch parameters
# Accepted keys are determined by the BatchDefinition used to instantiate this ValidationDefinition.
batch_parameters_dataframe = {"dataframe": dataframe}
batch_parameters_daily = {"year": "2020", "month": "1", "day": "17"}
batch_parameters_yearly = {"year": "2019"}

# Run the Validation Definition
# validation_results = validation_definition.run(batch_parameters=batch_parameters_yearly)
validation_results = validation_definition.run(batch_parameters=batch_parameters_yearly)

# Review the Validation Results
# print(validation_results)

Error:

BuildBatchRequestError Traceback (most recent call last)
Cell In[157], line 19
14 batch_parameters_yearly = {"year": "2019"}
16 # Run the Validation Definition
17 # validation_results = validation_definition.run(batch_parameters=batch_parameters_yearly)
18 #TODO: make runnable
---> 19 validation_results = validation_definition.run(batch_parameters=batch_parameters_yearly)
21 # Review the Validation Results
22 # print(validation_results)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/great_expectations/core/validation_definition.py:296, in ValidationDefinition.run(self, checkpoint_id, batch_parameters, expectation_parameters, result_format, run_id)
289 diagnostics.raise_for_error()
291 validator = Validator(
292 batch_definition=self.batch_definition,
293 batch_parameters=batch_parameters,
294 result_format=result_format,
295 )
--> 296 results = validator.validate_expectation_suite(self.suite, expectation_parameters)
297 results.meta["validation_id"] = self.id
298 results.meta["checkpoint_id"] = checkpoint_id

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/great_expectations/validator/v1_validator.py:71, in Validator.validate_expectation_suite(self, expectation_suite, expectation_parameters)
65 def validate_expectation_suite(
66 self,
67 expectation_suite: ExpectationSuite,
68 expectation_parameters: Optional[SuiteParameterDict] = None,
69 ) -> ExpectationSuiteValidationResult:
70 """Run an expectation suite against the batch definition"""
---> 71 results = self._validate_expectation_configs(
72 expectation_configs=expectation_suite.expectation_configurations,
73 expectation_parameters=expectation_parameters,
74 )
75 statistics = calc_validation_statistics(results)
77 return ExpectationSuiteValidationResult(
78 results=results,
79 success=statistics.success,
(...)
99 batch_id=self.active_batch_id,
100 )

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/great_expectations/validator/v1_validator.py:123, in Validator._validate_expectation_configs(self, expectation_configs, expectation_parameters)
117 def _validate_expectation_configs(
118 self,
119 expectation_configs: list[ExpectationConfiguration],
120 expectation_parameters: Optional[SuiteParameterDict] = None,
121 ) -> list[ExpectationValidationResult]:
122 """Run a list of expectation configurations against the batch definition"""
--> 123 processed_expectation_configs = self._wrapped_validator.process_expectations_for_validation(
124 expectation_configs, expectation_parameters
125 )
127 runtime_configuration: dict
128 if isinstance(self.result_format, ResultFormat):

File ~/cluster-env/clonedenv/lib/python3.10/functools.py:981, in cached_property.get(self, instance, owner)
979 val = cache.get(self.attrname, _NOT_FOUND)
980 if val is _NOT_FOUND:
--> 981 val = self.func(instance)
982 try:
983 cache[self.attrname] = val

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/great_expectations/validator/v1_validator.py:112, in Validator._wrapped_validator(self)
110 @cached_property
111 def _wrapped_validator(self) -> OldValidator:
--> 112 batch_request = self._batch_definition.build_batch_request(
113 batch_parameters=self._batch_parameters
114 )
115 return self._get_validator(batch_request=batch_request)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/great_expectations/core/batch_definition.py:67, in BatchDefinition.build_batch_request(self, batch_parameters)
63 def build_batch_request(
64 self, batch_parameters: Optional[BatchParameters] = None
65 ) -> BatchRequest[PartitionerT]:
66 """Build a BatchRequest from the asset and batch parameters."""
---> 67 return self.data_asset.build_batch_request(
68 options=batch_parameters,
69 partitioner=self.partitioner,
70 )

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/great_expectations/datasource/fluent/spark_datasource.py:238, in DataFrameAsset.build_batch_request(self, options, batch_slice, partitioner)
232 raise BuildBatchRequestError(
233 message="partitioner is not currently supported for this DataAsset "
234 "and must be None."
235 )
237 if not (options is not None and "dataframe" in options and len(options) == 1):
--> 238 raise BuildBatchRequestError(message="options must contain exactly 1 key, 'dataframe'.")
240 if not self.is_spark_data_frame(options["dataframe"]):
241 raise BuildBatchRequestError(
242 message="Cannot build batch request without a Spark DataFrame."
243 )

BuildBatchRequestError: Bad input to build_batch_request: options must contain exactly 1 key, 'dataframe'.

Expected behavior
I would expect the code to run through and run a validation_definition

Environment (please complete the following information):

  • Operating System: Windows
  • Great Expectations Version: 1.5.8
  • Data Source: Spark
  • Cloud environment: Azure Synapse

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions