Skip to content

Instrument strchr to use the string refinement solver #5200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

xbauch
Copy link
Contributor

@xbauch xbauch commented Dec 5, 2019

Reusing and extending the Java-based indexOf. The string refinement assumes that certain associations are made between the char-pointer input, (some) array string (, and it's length). There are also assumption about exactly what kinds of expressions can be passed to the string solver representing string. We added some special handling for byte-extract expressions and string-constants. These modification should not change the java part.

strchr choice is just for demonstration, the intent is to implement all/most of the string.h functions.

  • Each commit message has a non-empty body, explaining why the change was made.
  • Methods or procedures I have added are documented, following the guidelines provided in CODING_STANDARD.md.
  • The feature or user visible behaviour I have added or modified has been documented in the User Guide in doc/cprover-manual/
  • Regression or unit tests are included, or existing tests cover the modified code (in this case I have detailed which ones those are in the commit message).
  • My commit message includes data points confirming performance improvements (if claimed).
  • My PR is restricted to a single feature or bugfix.
  • White-space or formatting changes outside the feature-related changed lines are in commits of their own.

Sorry, something went wrong.

int main()
{
int i;
char *test = i ? "fo\0tis" : "notfoti";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a test for the case when a character can’t be found as well, e.g. in this case 't' may or may not be found depending on i.

@thk123
Copy link
Contributor

thk123 commented Jan 30, 2020

Note, this currently fails CI on Travis on the following test:

Running string-refinement-strchr2/test.desc  [FAILED]

Specifically fails to find:

^\[main.assertion.\d+\] line \d+ assertion first_o != NULL: SUCCESS$

Instead:

[main.assertion.1] line 10 assertion first_o != ((void *)0): SUCCESS

I guess different platform, different way of printing null?

For handling byte-extract expression, string-constants etc., in places that
expect string arrays. Also slightly extend the definition of char-type.
into a new c-specific axiom-generating function. Basically, the addition encodes
the existence of terminating-zero and the properties this entails.
by
1: associating the `src` with a new infinite array,
2: associating the array with its length variable,
3: applying the c_index_of function
4: translating the returned index to the expected char-pointer/NULL.
extracting string arrays from byte-expressions and string constants, hiding the
`symex::args` renaming, etc.
using the string-refinement solver.
from offset 0, i.e. the whole array should be the result.
@xbauch xbauch force-pushed the feature/c-string-solver branch from d6f02c0 to 0879ad0 Compare February 7, 2020 13:24
@xbauch xbauch marked this pull request as ready for review February 7, 2020 13:25
Copy link
Contributor

@thk123 thk123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! I have a number of suggestions of ways we can make the quite dense string solver code a bit more readable, happy to help with applying them.

I would like to see the coverage report as there ends up being lots of new code, and I don't really know when it would be triggered, so would be good to check we're testing it all. Hopefully will just appear, but maybe needs a rebase?

STRING_REFINEMENT_MAX_CHAR_WIDTH;

if(type.id() == ID_signedbv)
return to_signedbv_type(type).get_width() <= 8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 Magic number here should have some explanation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documentation of the function needs to be updated, and does it make sense to have signed characters?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romainbrenguier char is allowed to be either signed or unsigned in C.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out the magic is not needed at all.


const auto &byte_extract_expr = to_byte_extract_expr(expr);
const auto &offset = byte_extract_expr.offset();
PRECONDITION(offset.id() == ID_constant);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❔ how come this is a precondition? Seems like should be documented at least

const auto &byte_extract_expr = to_byte_extract_expr(expr);
const auto &offset = byte_extract_expr.offset();
PRECONDITION(offset.id() == ID_constant);
if(offset.id() != ID_constant)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Unreachable? You've just asserted this isn't true

@@ -192,3 +200,68 @@ array_exprt interval_sparse_arrayt::concretize(
size, it == entries.end() ? default_value : it->second);
return array;
}

exprt maybe_byte_extract_expr(const exprt &expr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ What's "maybe" about this? Would returning an optional make sense here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it always returns some expr. But it acts as identity for expressions that are not byte-extract.


exprt maybe_remove_string_exprs(const exprt &expr)
{
return [&]() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ What's the purpose of this lambda...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fairly sure there isn’t one. Probably happened by accident during a refactor.

assertion->source_location.set_property_class("string");
assertion->source_location.set_comment(
"zero-termination of string argument of strchr");
auto new_aux_symbol =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Don't you have a function that does this?

@@ -41,16 +41,19 @@ class incorrect_source_program_exceptiont : public cprover_exception_baset
void string_instrumentation(
symbol_tablet &,
message_handlert &,
goto_programt &);
goto_programt &,
size_t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⛏️ naming variables in the header (particularly ones where the type tells you nothing) helps users of the function to figure out what they should be passing in

@@ -7,6 +7,8 @@ Author: Romain Brenguier, romain.brenguier@diffblue.com
\*******************************************************************/

#include "array_pool.h"
#include "string_refinement_util.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⛏️ etc. in the commit message is less than ideal, perhaps it should be multiple commits, but should almost certainly list all the changes

if(
rhs.id() == ID_symbol &&
has_prefix(
id2string(to_symbol_expr(rhs).get_identifier()), "symex::args::"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard coded string seems like a bad idea, is it defined anywhere?

@@ -632,6 +632,17 @@ optionalt<exprt> get_subexpression_at_offset(
return typecast_exprt(expr, target_type_raw);
}

if(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 Please add why you need to do this (particularly in light of the comment yesterday about perf cost of byte_extract

attempt_assign_length_from_type(
to_array_string_expr(if_expr.false_case()),
length_of_array,
symbol_generator);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that doesn't seem correct, the length could be assigned twice with different value, while it only has to be equal to one of these depending on the condition

STRING_REFINEMENT_MAX_CHAR_WIDTH;

if(type.id() == ID_signedbv)
return to_signedbv_type(type).get_width() <= 8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documentation of the function needs to be updated, and does it make sense to have signed characters?

@majakusber
Copy link

majakusber commented Feb 7, 2020

Codecov Report

Merging #5200 into develop will increase coverage by 0.07%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #5200      +/-   ##
===========================================
+ Coverage    67.43%   67.51%   +0.07%     
===========================================
  Files         1157     1157              
  Lines        95354    95536     +182     
===========================================
+ Hits         64306    64504     +198     
+ Misses       31048    31032      -16     
Flag Coverage Δ
#cproversmt2 42.55% <2.97%> (-0.13%) ⬇️
#regression 64.05% <87.12%> (+0.09%) ⬆️
#unit 31.90% <13.36%> (-0.05%) ⬇️
Impacted Files Coverage Δ
src/goto-programs/string_instrumentation.cpp 28.49% <0.00%> (+28.49%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cccdfd9...0879ad0. Read the comment docs.

@thk123
Copy link
Contributor

thk123 commented Feb 7, 2020

Looks like two big areas not currently exercised:

  1. maybe_byte_extract_expr never gets a non-zero offset (here)
  2. Any of the new code for string_instrumentationt::do_strchr (here)

Not sure exactly what they'd require (possibly the 2nd one is just the code isn't needed?)

Copy link
Collaborator

@martin-cs martin-cs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not the expert on the string solver so perhaps it is not my place to say but this doesn't feel like a particularly good design for supporting a large number of string functions. My experience suggests that you want a minimal number of (ideally orthogonal) primitives that are supported by the solver and the rest of the complexity and modelling can be in the library models. That is a lot easier to edit, test (given an executable form of the primitives), understand and modify.

if(options.get_bool_option("string-abstraction"))
string_instrumentation(goto_model, log.get_message_handler());
if(
options.get_bool_option("string-abstraction") ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am getting mixed up but I thought the "string-abstraction" options did something very different? Maybe that was in goto-instrument.

@romainbrenguier
Copy link
Contributor

@martin-cs

I'm not the expert on the string solver so perhaps it is not my place to say but this doesn't feel like a particularly good design for supporting a large number of string functions. My experience suggests that you want a minimal number of (ideally orthogonal) primitives that are supported by the solver and the rest of the complexity and modelling can be in the library models. That is a lot easier to edit, test (given an executable form of the primitives), understand and modify.

Yes, I agree it would be better to avoid adding new primitives but rather reuse the existing ones and use models to deal with the differences between languages.

@thk123
Copy link
Contributor

thk123 commented Feb 11, 2020

When trying this branch out, I noticed the following example on this branch causes an invariant to be hit:

#include <assert.h>
#include <string.h>

void testFunction() {
  int size;
  __CPROVER_assume(size >= 10 && size <= 100);
  char *str = malloc(sizeof(char) * size);
  for(int i = 0; i < size-1; ++i)
  {
    char nondet_char;
    str[i] = nondet_char;
  }
  str[size - 1] = '\0';

  char* result = strchr(str, 'z');
  assert(result - str != 15);
}

@xbauch xbauch mentioned this pull request Feb 21, 2020
7 tasks
@xbauch
Copy link
Contributor Author

xbauch commented Feb 21, 2020

@martin-cs and @romainbrenguier there's a draft PR (#5239) which implements strchr by modelling the functionality inside string.h library and then instrumenting it to call the index-of directly. Have a look if you think that approach would be preferable to separating Java and C implementation. Note that there are some incompatible assumptions about types of things (array-size-type, char-type, etc.) between Java and C, so some changes to the originally java-only primitives are necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants