Skip to content

Commit 9f15e48

Browse files
Merge pull request #155 from ossf/regex_platforms
Show regex variations by platform
2 parents 3a01184 + 5b9663e commit 9f15e48

File tree

1 file changed

+90
-10
lines changed

1 file changed

+90
-10
lines changed

secure_software_development_fundamentals.md

+90-10
Original file line numberDiff line numberDiff line change
@@ -1708,15 +1708,17 @@ The usual way to require a regex to match an entire input is to include *anchors
17081708

17091709
**ca[brt]**
17101710

1711-
In contrast, this regex will only match *exactly* the words “**cab**”, “**car**”, or “**cat**” in most regex implementations, because “**^**” means *“match the beginning”* and “**$**” means *“match the end”*:
1711+
In contrast, this regex will only match *exactly* the words “**cab**”, “**car**”, or “**cat**” in many regex implementations, because “**^**” often means *“match the beginning”* and “**$**” often means *“match the end”*:
17121712

17131713
**^ca[brt]$**
17141714

1715-
In some implementations (depending on the option), “**^**” may mean *“beginning of any line”* not *“beginning of the string”* - and you usually want *“beginning of the string”*. A similar thing can happen with “**$**”. From here on we will assume that “**^**” and “**$**” mean beginning and end of the entire string.
1715+
In some implementations (depending on the option), “**^**” may mean *“beginning of any line”* not *“beginning of the string”* - and you usually want *“beginning of the string”*. A “**$**” means *"end of the string*" in some implementations, and not in others. Since regular expression notations vary, it's important to know about your specific regex expression system.
17161716

17171717
#### Know Your Regex Implementation
17181718

1719-
Almost every programming language has at least one good regex implementation. They all share many features, but many are slightly different. So, when you use a regex implementation you have not used before, look at its documentation every time you use an operation that you have not used before. Here are some variations to look for.
1719+
Regex notations are *not* the same between different languages and libraries. Almost every programming language has at least one good regex implementation and they all share many features. However, they are often slightly different.
1720+
1721+
So, when you use a regex implementation you have not used before, look at its documentation every time you use an operation that you have not used before. Also, be careful when reusing a pattern. Here are some variations to look for.
17201722

17211723
There are three major families of regex language notations:
17221724

@@ -1728,19 +1730,97 @@ There are three major families of regex language notations:
17281730

17291731
Here are some important things that vary:
17301732

1731-
* Sometimes there is an option or alternative method to match the entire input; if available, you can use that instead of the anchoring symbols. Make sure it matches the whole thing, though; some methods only check the beginning.
1733+
1. Sometimes there is an option or alternative method to match the entire input; if available, you can use that instead of the anchoring symbols. Make sure it matches the whole thing, though; some methods only check the beginning.
1734+
1735+
2. Sometimes “**^**” matches the beginning of all the data, while in others it represents the beginning of any line in the data. This is often controlled by a *multiline* option.
17321736

1733-
* Sometimes “**^**” matches the beginning of the whole data, while in others it represents the beginning of any line in the data. The same goes for “**$**”. This is often controlled by a *multiline* option.
1737+
3. Sometimes “**$**” matches the end of all the data, while in others it represents the end of any line in the data. In some systems, an optional newline character (or similar) is also always accepted. In some systems you must use "**\z**" to match the end of the data, but in Python you must use "**\Z**".
17341738

1735-
* The “**.**” for representing *“any character”* doesn’t always match the newline character (**\n**); often there is an option to turn this on or off.
1739+
4. The “**.**” for representing *“any character”* doesn’t always match the newline character (**\n**); often there is an option to turn this on or off.
17361740

1737-
* Does it properly support Unicode and the encoding you are using?
1741+
5. Some properly support Unicode and the encoding you are using; others do not.
17381742

1739-
* Can it handle data with the **NUL** character (byte value 0) within the data? If not, and your input data could have an embedded **NUL** character, you will need to validate the data first to make sure there are no **NUL** characters before passing the data to the regex implementation.
1743+
6. Some can handle data with the **NUL** character (byte value 0) within the data; others do not. If not, and your input data could have an embedded **NUL** character, you will need to validate the data first to make sure there are no **NUL** characters before passing the data to the regex implementation.
17401744

1741-
* Is matching case-sensitive? Usually it is case-sensitive by default, and there is a trivial way to make it case-insensitive. If it is case-insensitive, remember that exactly what characters have case-insensitive matches depends on the locale. For example, “**I**” and “**i**” match in the English (“**en**”) and the C locale (“**C**”), but not in the Turkish (“**tr**”). In the Turkish locale, the Unicode LATIN CAPITAL LETTER I matches the LATIN SMALL LETTER DOTLESS I - not a lowercase “**i**”.
1745+
7. Some do case-sensitive matching by default; others do not. Usually it is case-sensitive by default, and there is a trivial way to make it case-insensitive. If it is case-insensitive, remember that exactly what characters have case-insensitive matches depends on the locale. For example, “**I**” and “**i**” match in the English (“**en**”) and the C locale (“**C**”), but not in the Turkish (“**tr**”). In the Turkish locale, the Unicode LATIN CAPITAL LETTER I matches the LATIN SMALL LETTER DOTLESS I - not a lowercase “**i**”.
17421746

1743-
In some languages, such as in Ruby, you normally use **\A** and **\z** instead of “**^**” and “**$**” to match string begin/end, because “**^**” and “**$**” match line begin/end instead.
1747+
The following table shows how to create a regex pattern that matches an entire input string for some common platforms, as provided by [Correctly Using Regular Expressions for Secure Input Validation](https://best.openssf.org/Correctly-Using-Regular-Expressions). There's no need to memorize this; the point to understand is to make sure you use the correct symbols for the platform you're using:
1748+
1749+
<table>
1750+
<tr>
1751+
<td>
1752+
Platform
1753+
</td>
1754+
<td>Prepend
1755+
</td>
1756+
<td>Append
1757+
</td>
1758+
</tr>
1759+
<tr>
1760+
<td>POSIX BRE, POSIX ERE, and ECMAScript (JavaScript)
1761+
</td>
1762+
<td>“^” (not “\A”)
1763+
</td>
1764+
<td>“$” (not “\z” nor “\Z”)
1765+
</td>
1766+
</tr>
1767+
<tr>
1768+
<td>Perl, .NET/C#
1769+
</td>
1770+
<td>“^” or “\A”
1771+
</td>
1772+
<td>“\z” (not “$”)
1773+
</td>
1774+
</tr>
1775+
<tr>
1776+
<td>Java
1777+
</td>
1778+
<td>“^” or “\A”
1779+
</td>
1780+
<td>“\z”; “$” works but some documents conflict
1781+
</td>
1782+
</tr>
1783+
<tr>
1784+
<td>PHP
1785+
</td>
1786+
<td>“^” or “\A”
1787+
</td>
1788+
<td>“\z”; “$” with “D” modifier
1789+
</td>
1790+
</tr>
1791+
<tr>
1792+
<td>PCRE
1793+
</td>
1794+
<td>“^” or “\A”
1795+
</td>
1796+
<td>“\z”; “$” with `PCRE2_DOLLAR_ENDONLY`
1797+
</td>
1798+
</tr>
1799+
<tr>
1800+
<td>Golang, Rust crate regex, and RE2
1801+
</td>
1802+
<td>“^” or “\A”
1803+
</td>
1804+
<td>“\z” or “$”
1805+
</td>
1806+
</tr>
1807+
<tr>
1808+
<td>Python
1809+
</td>
1810+
<td>“^” or “\A”
1811+
</td>
1812+
<td>“\Z” (not “$” nor “\z”)
1813+
</td>
1814+
</tr>
1815+
<tr>
1816+
<td>Ruby
1817+
</td>
1818+
<td>“\A” (not “^”)
1819+
</td>
1820+
<td>“\z” (not “$”)
1821+
</td>
1822+
</tr>
1823+
</table>
17441824

17451825
#### Branch Priority
17461826

0 commit comments

Comments
 (0)