Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions Strata/Languages/Laurel/LaurelToCoreTranslator.lean
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,10 @@ def translateType (model : SemanticModel) (ty : HighTypeMd) : LMonoTy :=
| .TSet elementType => Core.mapTy (translateType model elementType) LMonoTy.bool
| .TMap keyType valueType => Core.mapTy (translateType model keyType) (translateType model valueType)
| .UserDefined name =>
-- Composite types map to "Composite"; datatypes map to their own name
match name.uniqueId.bind model.refToDef.get? with
-- Composite types map to "Composite"; datatypes map to their own name.
-- "regex" is a Core builtin sort used by the regex prelude; pass it through.
if name.text == "regex" then .tcons "regex" []
Copy link
Contributor

@keyboardDrummer keyboardDrummer Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. Use .TCore "regex" instead of a .Userdefined "regex" type, so you don't need this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that comment is LLM nonsense; I'll just follow the convention of the other builtin Laurel types that map to Core types by DDM ops for them.

else match name.uniqueId.bind model.refToDef.get? with
| some (.compositeType _) => .tcons "Composite" []
| some (.datatypeDefinition dt) => .tcons dt.name.text []
| _ => .tcons "Composite" [] -- fallback for unresolved refs
Expand Down
153 changes: 84 additions & 69 deletions Strata/Languages/Python/PyFactory.lean
Original file line number Diff line number Diff line change
Expand Up @@ -16,90 +16,105 @@ public section
-------------------------------------------------------------------------------

/-
Candidate translation pass for Python `re` code:

## Python Code:

```
...
PATTERN = r"^[a-z0-9][a-z0-9.-]{1,3}[a-z0-9]$"
REGEX = re.compile(PATTERN) # default flags == 0
...
if not re.match(REGEX, name) then # default flags == 0
return False
...
```

## Corresponding Strata.Core:

```
procedure _main () {

var PATTERN : string = "^[a-z0-9][a-z0-9.-]{1,3}[a-z0-9]$";

var REGEX : regex;
var $__REGEX : Except Error regex := PyReCompile(PATTERN, 0)

if ExceptErrorRegex_isOK($__REGEX) then {
REGEX := ExceptErrorRegex_getOK($__REGEX);
} else if (Error_isUnimplemented(ExceptErrorRegex_getError($__REGEX)) then {
// Unsupported by Strata.
havoc REGEX;
} else {
//
// TODO: Implement a version of `assert` that takes an expression to be
// evaluated when the assertion fails. In this case, we'd display the
// (computed) error message in `ExceptErrorRegex_getError($__REGEX)`.
//
// E.g., `assert false (printOnFailure := ExceptErrorRegex_getError($__REGEX));`
//
assert false;
}
...

if not PyReMatch(REGEX, name, 0) then
return false
}
```

## Python regex verification — factory functions

These factory functions provide `concreteEval` implementations that call
`pythonRegexToCore` to translate Python regex pattern strings into SMT-LIB
regex expressions at verification time.

### Architecture

Python's `re.compile` is a semantic no-op in the Laurel pipeline — it returns
the pattern string unchanged. The actual compilation to SMT-LIB regex happens
at the point of matching (`re.fullmatch`, `re.match`, `re.search`), where the
match mode is known.

This is important because `pythonRegexToCore` produces *different* regex
expressions depending on the mode -- in particular, it handles anchors (`^`/`$`)
by generating a union of anchor-activation variants with appropriate `.*`
wrapping (see `ReToCore.lean`). Compiling once at `re.compile` time would
require knowing the eventual match mode, which is not available.

### Factory functions

Each call to `re.fullmatch(pattern, s)` (and similarly `re.match`/`re.search`)
causes `pythonRegexToCore` to be called twice via `concreteEval`:

1. **`re_pattern_error(pattern)`** — Parses the pattern and returns
`NoError()` on success or `RePatternError(msg)` for a genuinely malformed
pattern. Unimplemented features return `NoError()` because Python would
succeed at runtime; the mode-specific factory staying uninterpreted provides
a sound over-approximation for those cases. The prelude checks this first
and returns `exception(err)` for pattern errors, modeling Python's
`re.error`.

2. **`re_fullmatch_str(pattern)`** (or `re_match_str`/`re_search_str`) —
Compiles the pattern with the correct `MatchMode`. Returns the compiled
regex on success, or `.none` on error (leaving the function uninterpreted).
Since pattern errors are already caught by `re_pattern_error`, the `.none`
case here only fires for unimplemented features, producing `unknown` VCs —
a sound over-approximation.

The double parse is defensible because `pythonRegexToCore` is fast enough -- it
runs at translation time, not solver time, and keeps the factory functions
orthogonal.
-/

open Core
open Lambda LTy.Syntax LExpr.SyntaxMono

def reCompileFunc : LFunc Core.CoreLParams :=
{ name := "PyReCompile",
-- Mode-specific regex compilation. Each function compiles a Python regex
-- string with the correct MatchMode so that anchors (^/$) are handled
-- properly.
private def mkModeCompileFunc (name : String) (mode : MatchMode) :
LFunc Core.CoreLParams :=
{ name := name,
typeArgs := [],
inputs := [("pattern", mty[string])],
output := mty[regex],
concreteEval := some
(fun _ args => match args with
| [LExpr.strConst () s] =>
let (expr, maybe_err) := pythonRegexToCore s mode
match maybe_err with
| none => .some expr
| some _ => .none
| _ => .none)
}

def reFullmatchStrFunc : LFunc Core.CoreLParams :=
mkModeCompileFunc "re_fullmatch_str" .fullmatch
def reMatchStrFunc : LFunc Core.CoreLParams :=
mkModeCompileFunc "re_match_str" .match
def reSearchStrFunc : LFunc Core.CoreLParams :=
mkModeCompileFunc "re_search_str" .search

def rePatternErrorFunc : LFunc Core.CoreLParams :=
{ name := "re_pattern_error",
typeArgs := [],
inputs := [("string", mty[string]),
("flags", mty[int])]
output := mty[ExceptErrorRegex],
inputs := [("pattern", mty[string])],
output := mty[Error],
concreteEval := some
(fun _ args => match args with
| [LExpr.strConst () s, LExpr.intConst () 0] =>
-- This function has a concrete evaluation implementation only when
-- flags == 0.
-- (FIXME): We use `.match` mode below because we support only
-- `re.match` for now. However, `re.compile` isn't mode-specific in
-- general.
let (expr, maybe_err) := pythonRegexToCore s .match
| [LExpr.strConst () s] =>
let (_, maybe_err) := pythonRegexToCore s .fullmatch
match maybe_err with
| none =>
-- Note: Do not use `eb` (in Strata Core Syntax) here (e.g., see below)
-- eb[(~ExceptErrorRegex_mkOK expr)]
-- that captures `expr` as an `.fvar`.
.some (LExpr.mkApp () (.op () "ExceptErrorRegex_mkOK" none) [expr])
| some (ParseError.unimplemented msg _pattern _pos) =>
.some (LExpr.mkApp () (.op () "ExceptErrorRegex_mkErr" none)
[LExpr.mkApp () (.op () "Error_Unimplemented" none) [.strConst () (toString msg)]])
| some (ParseError.patternError msg _pattern _pos) =>
.some (LExpr.mkApp () (.op () "ExceptErrorRegex_mkErr" none)
[LExpr.mkApp () (.op () "Error_RePatternErr" none) [.strConst () (toString msg)]])
.some (LExpr.mkApp () (.op () "NoError" (some mty[Error])) [])
| some (ParseError.unimplemented ..) =>
.some (LExpr.mkApp () (.op () "NoError" (some mty[Error])) [])
| some (ParseError.patternError msg ..) =>
.some (LExpr.mkApp () (.op () "RePatternError" (some mty[string → Error]))
[.strConst () (toString msg)])
| _ => .none)
}

def ReFactory : @Factory Core.CoreLParams :=
#[
reCompileFunc
reFullmatchStrFunc,
reMatchStrFunc,
reSearchStrFunc,
rePatternErrorFunc
]

end -- public section
Expand Down
94 changes: 93 additions & 1 deletion Strata/Languages/Python/PythonLaurelCorePrelude.lean
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,8 @@ datatype Error () {
AssertionError (Assertion_msg : string),
UnimplementedError (Unimplement_msg : string),
UndefinedError (Undefined_msg : string),
IndexError (IndexError_msg : string)
IndexError (IndexError_msg : string),
RePatternError (Re_msg : string)
};

// /////////////////////////////////////////////////////////////////////////////////////
Expand Down Expand Up @@ -93,12 +94,103 @@ datatype DictStrAny () {
DictStrAny_cons (key: string, val: Any, tail: DictStrAny)
};

// Forward declarations: needed so the inline functions after CoreOnlyDelimiter
// can reference these during DDM parsing. Filtered out at merge (precede the
// sentinel). The real definitions with concreteEval are supplied by ReFactory
// at verification time.
function re_fullmatch_str(pattern : string) : regex;
function re_match_str(pattern : string) : regex;
function re_search_str(pattern : string) : regex;
function re_pattern_error(pattern : string) : Error;

type CoreOnlyDelimiter;

// =====================================================================
// Core-only declarations (not expressed in Laurel)
// =====================================================================

// /////////////////////////////////////////////////////////////////////////////////////
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried putting all of this in the Laurel prelude instead of the Core one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial few functions may have to stay in Core because they call the SMTLIB regex functions, but the functions that call them might be movable to Laurel. Let me look into that...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Managed to move them to the Laurel prelude, but getting all asserts with regexes going from pass -> unknown b/c those functions are no longer marked inline. Any existing way to do that in Laurel?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any existing way to do that in Laurel?

Not right now. I think I will change it so Laurel will just mark everything inline when using the Python front-end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - is that something you can do on Monday (tomorrow)? Otherwise I could sneak that into this PR

// Regex support
//
// Python signatures:
// re.compile(pattern: str) -> re.Pattern
// re.match(pattern: str | re.Pattern, string: str) -> re.Match | None
// re.search(pattern: str | re.Pattern, string: str) -> re.Match | None
// re.fullmatch(pattern: str | re.Pattern, string: str) -> re.Match | None
//
// Architecture:
//
// re.compile is a semantic no-op — it returns the pattern string unchanged.
// The mode-specific factory functions re_fullmatch_str, re_match_str,
// re_search_str each compile a pattern string to a regex with the correct
// MatchMode (via pythonRegexToCore), so anchors (^/$) are handled properly.
// Their concreteEval fires when the pattern is a string literal.
//
// The _bool helpers call the mode-specific factories, so there is a single
// source of truth for mode-specific compilation.
//
// On match, we return a from_ClassInstance wrapping a concrete re_Match
// with pos=0 and endpos=str.len(s), which is sound for the module-level
// API (no pos/endpos parameters).
// /////////////////////////////////////////////////////////////////////////////////////

// Mode-specific factory functions are declared via ReFactory (with concreteEval
// for literal pattern expansion), not in this prelude, to avoid duplicate
// definitions.

inline function re_fullmatch_bool(pattern : string, s : string) : bool {
str.in.re(s, re_fullmatch_str(pattern))
}
inline function re_match_bool(pattern : string, s : string) : bool {
str.in.re(s, re_match_str(pattern))
}
inline function re_search_bool(pattern : string, s : string) : bool {
str.in.re(s, re_search_str(pattern))
}

inline function mk_re_Match(s : string) : Any {
from_ClassInstance("re_Match",
DictStrAny_cons("re_match_string", from_string(s),
DictStrAny_cons("re_match_pos", from_int(0),
DictStrAny_cons("re_match_endpos", from_int(str.len(s)),
DictStrAny_empty()))))
}

// re.compile is a no-op: returns the pattern string unchanged.
inline function re_compile(pattern : Any) : Any
requires Any..isfrom_string(pattern);
{
pattern
}

inline function re_fullmatch(pattern : Any, s : Any) : Any
requires Any..isfrom_string(pattern) && Any..isfrom_string(s);
{
if Error..isRePatternError(re_pattern_error(Any..as_string!(pattern)))
then exception(re_pattern_error(Any..as_string!(pattern)))
else if re_fullmatch_bool(Any..as_string!(pattern), Any..as_string!(s))
then mk_re_Match(Any..as_string!(s))
else from_none()
}
inline function re_match(pattern : Any, s : Any) : Any
requires Any..isfrom_string(pattern) && Any..isfrom_string(s);
{
if Error..isRePatternError(re_pattern_error(Any..as_string!(pattern)))
then exception(re_pattern_error(Any..as_string!(pattern)))
else if re_match_bool(Any..as_string!(pattern), Any..as_string!(s))
then mk_re_Match(Any..as_string!(s))
else from_none()
}
inline function re_search(pattern : Any, s : Any) : Any
requires Any..isfrom_string(pattern) && Any..isfrom_string(s);
{
if Error..isRePatternError(re_pattern_error(Any..as_string!(pattern)))
then exception(re_pattern_error(Any..as_string!(pattern)))
else if re_search_bool(Any..as_string!(pattern), Any..as_string!(s))
then mk_re_Match(Any..as_string!(s))
else from_none()
}

// /////////////////////////////////////////////////////////////////////////////////////
//Functions that we provide to Python user
//to write assertions/contracts about about types of variables
Expand Down
45 changes: 44 additions & 1 deletion Strata/Languages/Python/PythonRuntimeLaurelPart.lean
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,8 @@ datatype Error {
AssertionError (Assertion_msg : string),
UnimplementedError (Unimplement_msg : string),
UndefinedError (Undefined_msg : string),
IndexError (IndexError_msg : string)
IndexError (IndexError_msg : string),
RePatternError (Re_msg : string)
}

// /////////////////////////////////////////////////////////////////////////////////////
Expand Down Expand Up @@ -89,6 +90,48 @@ datatype DictStrAny {
DictStrAny_cons (key: string, val: Any, tail: DictStrAny)
}

// /////////////////////////////////////////////////////////////////////////////////////
// Regex support — re.Match
//
// Models Python's re.Match as a composite (reference) type following the
// module_Class naming convention (re_Match).
//
// The Python-through-Laurel pipeline is entirely Any-typed: all user
// variables and function inputs/outputs are wrapped in the Any datatype.
// Consequently, re_match/re_search/re_fullmatch return Any (from_none
// or from_ClassInstance wrapping a re_Match). If the pipeline ever
// moves to concrete types, these should return re_Match | None directly.
//
// Fields that can be determined from the call site are set concretely
// in the Core-only prelude. pos and endpos are sound as 0 / str.len
// for the module-level re.match/re.search/re.fullmatch API which does
// not accept pos/endpos arguments. If compiled-pattern method calls
// with explicit pos/endpos are supported later, those values must be
// threaded through.
//
// Methods that depend on capture groups (group, start, end, span,
// groups, lastindex, lastgroup) are uninterpreted because SMT-LIB's
// string theory has no capture group support. This is a sound
// over-approximation: the solver treats them as abstract, so
// verification results involving these will be inconclusive rather
// than unsound.

composite re_Match {
var re_match_string : string
var re_match_pos : int
var re_match_endpos : int
}

// re.Match methods — uninterpreted (capture groups are beyond SMT-LIB)
function re_Match_group (self : re_Match, n : int) : string;
function re_Match_start (self : re_Match, n : int) : int;
function re_Match_end (self : re_Match, n : int) : int;
function re_Match_span_start (self : re_Match, n : int) : int;
function re_Match_span_end (self : re_Match, n : int) : int;
function re_Match_lastindex (self : re_Match) : int;
function re_Match_lastgroup (self : re_Match) : string;
function re_Match_groups (self : re_Match) : ListStr;

datatype FIRST_END_MARKER { }

#end
Expand Down
Loading
Loading