-
Notifications
You must be signed in to change notification settings - Fork 51
Add --format-email to perform full validation on "email" and "idn-email" formats #460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
trzejos
wants to merge
15
commits into
python-jsonschema:main
Choose a base branch
from
trzejos:email-format
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
5f4c7c5
wip: email format options
trzejos e660e16
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 43bef30
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 4bfadc1
update ParseResult
7bd81f8
nested set fix
20f7209
update changelog
b3df29c
line length
4bcc102
Apply suggestions from code review
trzejos 07d7204
removed --format-email CLI option. Refactored RFC5321 validation
8bc19e1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 0befdf4
add named match groups and acceptance tests
03686fd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] aef3647
Update test_format_email.py
trzejos 9d9a266
convert email acceptance tests to unit tests
31e2915
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,6 @@ | ||
from .iso8601_time import validate as validate_time | ||
from .rfc3339 import validate as validate_rfc3339 | ||
from .rfc5321 import validate as validate_rfc5321 | ||
from .rfc6531 import validate as validate_rfc6531 | ||
|
||
__all__ = ("validate_rfc3339", "validate_time") | ||
__all__ = ("validate_rfc3339", "validate_rfc5321", "validate_rfc6531", "validate_time") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
import re | ||
|
||
# ([!#-'*+/-9=?A-Z^-~-]+(\.[!#-'*+/-9=?A-Z^-~-]+)*|"([]!#-[^-~ \t]|(\\[\t -~]))+") | ||
# @ | ||
# ([!#-'*+/-9=?A-Z^-~-]+(\.[!#-'*+/-9=?A-Z^-~-]+)*|\[[\t -Z^-~]*]) | ||
# | ||
# [a-zA-Z0-9!#$%&'*+/=?^_`{|}~-] == Alphanumeric characters and most special characters except [ (),.:;<>@\[\]\t] | ||
# [a-zA-Z0-9 !#$%&'()*+,./:;<=>?@\[\]^_`{|}~\t-] == All printable characters except for " and \ | ||
# [\t -~] == All printable characters | ||
# [a-zA-Z0-9 !"#$%&'()*+,./:;<=>?@^_`{|}~\t-] == All printable characters except for the following characters []\ | ||
RFC5321_REGEX = re.compile( | ||
r""" | ||
^ | ||
(?P<local> | ||
[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)* | ||
| | ||
"(?:[a-zA-Z0-9 !#$%&'()*+,./:;<=>?@\[\]^_`{|}~\t-]|\\[\t -~])+" | ||
) | ||
@ | ||
(?P<domain> | ||
[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)* | ||
| | ||
\[[a-zA-Z0-9 !"#$%&'()*+,./:;<=>?@^_`{|}~\t-]*\] | ||
) | ||
$ | ||
""", | ||
re.VERBOSE | re.ASCII, | ||
) | ||
|
||
|
||
def validate(email_str: object) -> bool: | ||
"""Validate a string as a RFC5321 email address.""" | ||
if not isinstance(email_str, str): | ||
return False | ||
match = RFC5321_REGEX.match(email_str) | ||
if not match: | ||
return False | ||
local, domain = match.group("local", "domain") | ||
# Local part of email address is limited to 64 octets | ||
if len(local) > 64: | ||
return False | ||
# Domain names are limited to 253 octets | ||
if len(domain) > 253: | ||
return False | ||
for domain_part in domain.split("."): | ||
# DNS Labels are limited to 63 octets | ||
if len(domain_part) > 63: | ||
return False | ||
return True | ||
|
||
|
||
if __name__ == "__main__": | ||
import timeit | ||
|
||
N = 100_000 | ||
tests = (("basic", "[email protected]"),) | ||
|
||
print("benchmarking") | ||
for name, val in tests: | ||
all_times = timeit.repeat( | ||
f"validate({val!r})", globals=globals(), repeat=3, number=N | ||
) | ||
print(f"{name} (valid={validate(val)}): {int(min(all_times) / N * 10**9)}ns") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
import re | ||
|
||
RFC6531_REGEX = re.compile( | ||
r""" | ||
^ | ||
# local part | ||
( | ||
([0-9a-z!#$%&'*+-\/=?^_`\{|\}~\u0080-\U0010FFFF]+(\.[0-9a-z!#$%&'*+-\/=?^_`\{|\}~\u0080-\U0010FFFF]+)*) | ||
| | ||
# quoted string | ||
"([\x20-\x21\x23-\x5B\x5D-\x7E\u0080-\U0010FFFF]|\\[\x20-\x7E])*" | ||
) | ||
@ | ||
# Domain/address | ||
( | ||
# Address literal | ||
(\[( | ||
# IPv4 | ||
(\d{1,3}(\.\d{1,3}){3}) | ||
| | ||
# IPv6 | ||
(IPv6:[0-9a-f]{1,4}(:[0-9a-f]{1,4}){7}) | ||
| | ||
(IPv6:([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,5})?::([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,5})?) | ||
| | ||
(IPv6:[0-9a-f]{1,4}(:[0-9a-f]{1,4}){5}:\d{1,3}(\.\d{1,3}){3}) | ||
| | ||
(IPv6:([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,3})?::([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,3}:)?\d{1,3}(\.\d{1,3}){3}) | ||
| | ||
# General address | ||
([a-z0-9-]*[a-z0-9]:[\x21-\x5A\x5E-\x7E]+) | ||
)\]) | ||
| | ||
# Domain | ||
((?!.{256,})(([0-9a-z\u0080-\U0010FFFF]([0-9a-z-\u0080-\U0010FFFF]*[0-9a-z\u0080-\U0010FFFF])?))(\.([0-9a-z\u0080-\U0010FFFF]([0-9a-z-\u0080-\U0010FFFF]*[0-9a-z\u0080-\U0010FFFF])?))*) | ||
) | ||
$ | ||
""", | ||
re.VERBOSE | re.UNICODE, | ||
) | ||
|
||
|
||
def validate(email_str: object) -> bool: | ||
"""Validate a string as a RFC6531 email address.""" | ||
if not isinstance(email_str, str): | ||
return False | ||
return RFC6531_REGEX.match(email_str) | ||
|
||
|
||
if __name__ == "__main__": | ||
import timeit | ||
|
||
N = 100_000 | ||
tests = (("basic", "[email protected]"),) | ||
|
||
print("benchmarking") | ||
for name, val in tests: | ||
all_times = timeit.repeat( | ||
f"validate({val!r})", globals=globals(), repeat=3, number=N | ||
) | ||
print(f"{name} (valid={validate(val)}): {int(min(all_times) / N * 10**9)}ns") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
import pytest | ||
|
||
from check_jsonschema.formats.implementations.rfc5321 import validate | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"emailstr", | ||
( | ||
r"[email protected]", | ||
r"[email protected]", | ||
r"[email protected]", | ||
r"[email protected]", | ||
r"[email protected]", | ||
r"[email protected]", | ||
r"name/[email protected]", | ||
r"admin@example", | ||
r"[email protected]", | ||
r'" "@example.org', | ||
r'"john..doe"@example.org', | ||
r"[email protected]", | ||
r'"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com', | ||
r"user%[email protected]", | ||
r"[email protected]", | ||
r"postmaster@[123.123.123.123]", | ||
r"postmaster@[IPv6:2001:0db8:85a3:0000:0000:8a2e:0370:7334]", | ||
r"_test@[IPv6:2001:0db8:85a3:0000:0000:8a2e:0370:7334]", | ||
), | ||
) | ||
def test_simple_positive_cases(emailstr): | ||
assert validate(emailstr) | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"emailstr", | ||
( | ||
r"I❤️[email protected]", | ||
r"用户@例子.广告", | ||
r"ಬೆಂಬಲ@ಡೇಟಾಮೇಲ್.ಭಾರತ", | ||
r"अजय@डाटा.भारत", | ||
r"квіточка@пошта.укр", | ||
r"χρήστης@παράδειγμα.ελ", | ||
r"Dörte@Sörensen.example.com", | ||
r"коля@пример.рф", | ||
r"abc.example.com", | ||
r"a@b@[email protected]", | ||
r'a"b(c)d,e:f;g<h>i[j\k][email protected]', | ||
r'just"not"[email protected]', | ||
r'this is"not\[email protected]', | ||
r"this\ still\"not\\[email protected]", | ||
r"1234567890123456789012345678901234567890123456789012345678901234+x@example.com", | ||
r"i.like.underscores@but_they_are_not_allowed_in_this_part", | ||
r"trythis@123456789012345678901234567890123456789012345678901234567890123456.com", | ||
r"another@12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234.com", | ||
), | ||
) | ||
def test_simple_negative_case(emailstr): | ||
assert not validate(emailstr) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
import pytest | ||
|
||
from check_jsonschema.formats.implementations.rfc6531 import validate | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"emailstr", | ||
( | ||
r"[email protected]", | ||
r"[email protected]", | ||
r"[email protected]", | ||
r"[email protected]", | ||
r"[email protected]", | ||
r"[email protected]", | ||
r"name/[email protected]", | ||
r"admin@example", | ||
r"[email protected]", | ||
r'" "@example.org', | ||
r'"john..doe"@example.org', | ||
r"[email protected]", | ||
( | ||
r'"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com' | ||
r"user%[email protected]" | ||
), | ||
r"[email protected]", | ||
r"postmaster@[123.123.123.123]", | ||
r"postmaster@[IPv6:2001:0db8:85a3:0000:0000:8a2e:0370:7334]", | ||
r"_test@[IPv6:2001:0db8:85a3:0000:0000:8a2e:0370:7334]", | ||
r"I❤️[email protected]", | ||
r"用户@例子.广告", | ||
r"ಬೆಂಬಲ@ಡೇಟಾಮೇಲ್.ಭಾರತ", | ||
r"अजय@डाटा.भारत", | ||
r"квіточка@пошта.укр", | ||
r"χρήστης@παράδειγμα.ελ", | ||
r"Dörte@Sörensen.example.com", | ||
r"коля@пример.рф", | ||
), | ||
) | ||
def test_simple_positive_cases(emailstr): | ||
assert validate(emailstr) | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"emailstr", | ||
( | ||
r"abc.example.com", | ||
r"a@b@[email protected]", | ||
r'a"b(c)d,e:f;g<h>i[j\k][email protected]', | ||
r'just"not"[email protected]', | ||
r'this is"not\[email protected]', | ||
r"this\ still\"not\\[email protected]", | ||
r"1234567890123456789012345678901234567890123456789012345678901234+x@example.com", | ||
r"i.like.underscores@but_they_are_not_allowed_in_this_part", | ||
r"trythis@123456789012345678901234567890123456789012345678901234567890123456.com", | ||
r"another@12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234.com", | ||
), | ||
) | ||
def test_simple_negative_case(emailstr): | ||
assert not validate(emailstr) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding this regex a bit difficult to read. In particular, I'm seeing some unconventional range expressions in the character classes, like
/-9
,-Z
, and^-~
.These are valid, but they aren't the way character classes are typically written. Perhaps other people have an expert and intuitive knowledge of what
chr(ord(" ")+1)
will be, but I definitely don't.Can these be rewritten such that the suite of characters matched is more obvious to a reader? For example, rather than
[/-9]
, I would much rather see[/0-9]
. The fact that the lowercase letters are captured with^-~
caught me particularly off-guard.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. I'm not the original author of these regexes either (links in PR description). I'll see what I can do about cleaning up those character classes though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sirosen
I had some time this week to revisit this, and reworked the simpler RFC5321 validation. I also added in some length checks as well and tried validating it against the examples in this wikipedia page: https://en.wikipedia.org/wiki/Email_address#Examples
It validated/invalidated as expected except for a couple cases:
I❤️[email protected]
was incorrectly found invalid, likely because of UTF-8idn-email
formati.like.underscores@but_they_are_not_allowed_in_this_part
was incorrectly found invalid, we allow underscores in the domain part of the regex.Let me know if the regex is easier to understand and if you think we should need to handle non-ascii strings like that utf-8 one.
I have not revisited the idn-email validator yet