Skip to content

Commit 3d7821c

Browse files
authored
Merge pull request #820 from markshannon/python-incomplete-url-sanitize
Python: Two new queries for URL and hostname sanitization (CWE-020).
2 parents d776d9f + 9adb19f commit 3d7821c

14 files changed

+390
-3
lines changed

change-notes/1.20/analysis-python.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@ Removes false positives seen when using Python 3.6, but not when using earlier v
1414
| **Query** | **Tags** | **Purpose** |
1515
|-----------------------------|-----------|--------------------------------------------------------------------|
1616
| Default version of SSL/TLS may be insecure (`py/insecure-default-protocol`) | security, external/cwe/cwe-327 | Finds instances where an insecure default protocol may be used. Results are shown on LGTM by default. |
17+
| Incomplete regular expression for hostnames (`py/incomplete-hostname-regexp`) | security, external/cwe/cwe-020 | Finds instances where a hostname is incompletely sanitized because a regular expression contains an unescaped character. Results are shown on LGTM by default. |
18+
| Incomplete URL substring sanitization (`py/incomplete-url-substring-sanitization`) | security, external/cwe/cwe-020 | Finds instances where a URL is incompletely sanitized due to insufficient checks. Results are shown on LGTM by default. |
1719
| Overly permissive file permissions (`py/overly-permissive-file`) | security, external/cwe/cwe-732 | Finds instances where a file is created with overly permissive permissions. Results are not shown on LGTM by default. |
1820
| Use of insecure SSL/TLS version (`py/insecure-protocol`) | security, external/cwe/cwe-327 | Finds instances where a known insecure protocol has been specified. Results are shown on LGTM by default. |
1921

javascript/ql/src/Security/CWE-020/IncompleteUrlSubstringSanitization.qhelp

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,9 @@
1515

1616
<p>
1717

18-
However, it is notoriously error-prone to treat the URL as
19-
a string and check if one of the allowed hosts is a substring of the
20-
URL. Malicious URLs can bypass such security checks by embedding one
18+
However, treating the URL as a string and checking if one of the
19+
allowed hosts is a substring of the URL is very prone to errors.
20+
Malicious URLs can bypass such security checks by embedding one
2121
of the allowed hosts in an unexpected location.
2222

2323
</p>
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
<!DOCTYPE qhelp PUBLIC
2+
"-//Semmle//qhelp//EN"
3+
"qhelp.dtd">
4+
<qhelp>
5+
6+
<overview>
7+
<p>
8+
9+
Sanitizing untrusted URLs is an important technique for
10+
preventing attacks such as request forgeries and malicious
11+
redirections. Often, this is done by checking that the host of a URL
12+
is in a set of allowed hosts.
13+
14+
</p>
15+
16+
<p>
17+
18+
If a regular expression implements such a check, it is
19+
easy to accidentally make the check too permissive by not escaping the
20+
<code>.</code> meta-characters appropriately.
21+
22+
Even if the check is not used in a security-critical
23+
context, the incomplete check may still cause undesirable behaviors
24+
when it accidentally succeeds.
25+
26+
</p>
27+
</overview>
28+
29+
<recommendation>
30+
<p>
31+
32+
Escape all meta-characters appropriately when constructing
33+
regular expressions for security checks, pay special attention to the
34+
<code>.</code> meta-character.
35+
36+
</p>
37+
</recommendation>
38+
39+
<example>
40+
41+
<p>
42+
43+
The following example code checks that a URL redirection
44+
will reach the <code>example.com</code> domain, or one of its
45+
subdomains.
46+
47+
</p>
48+
49+
<sample src="examples/IncompleteHostnameRegExp.py"/>
50+
51+
<p>
52+
The <code>unsafe</code> check is easy to bypass because the unescaped
53+
<code>.</code> allows for any character before
54+
<code>example.com</code>, effectively allowing the redirect to go to
55+
an attacker-controlled domain such as <code>wwwXexample.com</code>.
56+
57+
</p>
58+
<p>
59+
The <code>safe</code> check closes this vulnerability by escaping the <code>.</code>
60+
so that URLs of the form <code>wwwXexample.com</code> are rejected.
61+
</p>
62+
63+
</example>
64+
65+
<references>
66+
<li>OWASP: <a href="https://www.owasp.org/index.php/Server_Side_Request_Forgery">SSRF</a></li>
67+
<li>OWASP: <a href="https://www.owasp.org/index.php/Unvalidated_Redirects_and_Forwards_Cheat_Sheet">XSS Unvalidated Redirects and Forwards Cheat Sheet</a>.</li>
68+
</references>
69+
</qhelp>
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
/**
2+
* @name Incomplete regular expression for hostnames
3+
* @description Matching a URL or hostname against a regular expression that contains an unescaped dot as part of the hostname might match more hostnames than expected.
4+
* @kind problem
5+
* @problem.severity warning
6+
* @precision high
7+
* @id py/incomplete-hostname-regexp
8+
* @tags correctness
9+
* security
10+
* external/cwe/cwe-20
11+
*/
12+
13+
import python
14+
import semmle.python.regex
15+
16+
private string commonTopLevelDomainRegex() {
17+
result = "com|org|edu|gov|uk|net|io"
18+
}
19+
20+
/**
21+
* Holds if `pattern` is a regular expression pattern for URLs with a host matched by `hostPart`,
22+
* and `pattern` contains a subtle mistake that allows it to match unexpected hosts.
23+
*/
24+
bindingset[pattern]
25+
predicate isIncompleteHostNameRegExpPattern(string pattern, string hostPart) {
26+
hostPart = pattern
27+
.regexpCapture("(?i).*" +
28+
// an unescaped single `.`
29+
"(?<!\\\\)[.]" +
30+
// immediately followed by a sequence of subdomains, perhaps with some regex characters mixed in, followed by a known TLD
31+
"([():|?a-z0-9-]+(\\\\)?[.](" + commonTopLevelDomainRegex() + "))" + ".*", 1)
32+
}
33+
34+
from Regex r, string pattern, string hostPart
35+
where
36+
(
37+
r.getText() = pattern
38+
) and
39+
isIncompleteHostNameRegExpPattern(pattern, hostPart) and
40+
// ignore patterns with capture groups after the TLD
41+
not pattern.regexpMatch("(?i).*[.](" + commonTopLevelDomainRegex() + ").*[(][?]:.*[)].*")
42+
select r,
43+
"This regular expression has an unescaped '.' before '" + hostPart +
44+
"', so it might match more hosts than expected."
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
<!DOCTYPE qhelp PUBLIC
2+
"-//Semmle//qhelp//EN"
3+
"qhelp.dtd">
4+
<qhelp>
5+
6+
<overview>
7+
<p>
8+
9+
Sanitizing untrusted URLs is an important technique for
10+
preventing attacks such as request forgeries and malicious
11+
redirections. Usually, this is done by checking that the host of a URL
12+
is in a set of allowed hosts.
13+
14+
</p>
15+
16+
<p>
17+
However, treating the URL as a string and checking if one of the
18+
allowed hosts is a substring of the URL is very prone to errors.
19+
Malicious URLs can bypass such security checks by embedding one
20+
of the allowed hosts in an unexpected location.
21+
22+
</p>
23+
24+
<p>
25+
26+
Even if the substring check is not used in a
27+
security-critical context, the incomplete check may still cause
28+
undesirable behaviors when the check succeeds accidentally.
29+
30+
</p>
31+
</overview>
32+
33+
<recommendation>
34+
<p>
35+
36+
Parse a URL before performing a check on its host value,
37+
and ensure that the check handles arbitrary subdomain sequences
38+
correctly.
39+
40+
</p>
41+
</recommendation>
42+
43+
<example>
44+
45+
<p>
46+
47+
The following example code checks that a URL redirection
48+
will reach the <code>example.com</code> domain.
49+
50+
</p>
51+
52+
<sample src="examples/IncompleteUrlSubstringSanitization.py"/>
53+
54+
<p>
55+
56+
The first two examples show unsafe checks that are easily bypassed.
57+
In <code>unsafe1</code> the attacker can simply add
58+
<code>example.com</code> anywhere in the url. For example,
59+
<code>http://evil-example.net/example.com</code>.
60+
</p>
61+
<p>
62+
In <code>unsafe2</code> the attacker must use a hostname ending in
63+
<code>example.com</code>, but that is easy to do. For example,
64+
<code>http://benign-looking-prefix-example.com</code>.
65+
66+
</p>
67+
68+
<p>
69+
70+
The second two examples show safe checks.
71+
In <code>safe1</code>, a white-list is used. Although fairly inflexible,
72+
this is easy to get right and is most likely to be safe.
73+
</p>
74+
<p>
75+
In <code>safe2</code>, <code>urlparse</code> is used to parse the URL,
76+
then the hostname is checked to make sure it ends with <code>.example.com</code>.
77+
</p>
78+
79+
</example>
80+
81+
<references>
82+
<li>OWASP: <a href="https://www.owasp.org/index.php/Server_Side_Request_Forgery">SSRF</a></li>
83+
<li>OWASP: <a href="https://www.owasp.org/index.php/Unvalidated_Redirects_and_Forwards_Cheat_Sheet">XSS Unvalidated Redirects and Forwards Cheat Sheet</a>.</li>
84+
</references>
85+
</qhelp>
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
/**
2+
* @name Incomplete URL substring sanitization
3+
* @description Security checks on the substrings of an unparsed URL are often vulnerable to bypassing.
4+
* @kind problem
5+
* @problem.severity warning
6+
* @precision high
7+
* @id py/incomplete-url-substring-sanitization
8+
* @tags correctness
9+
* security
10+
* external/cwe/cwe-20
11+
*/
12+
13+
14+
import python
15+
import semmle.python.regex
16+
17+
private string commonTopLevelDomainRegex() {
18+
result = "com|org|edu|gov|uk|net|io"
19+
}
20+
21+
predicate looksLikeUrl(StrConst s) {
22+
exists(string text |
23+
text = s.getText()
24+
|
25+
text.regexpMatch("(?i)([a-z]*:?//)?\\.?([a-z0-9-]+\\.)+(" +
26+
commonTopLevelDomainRegex() +")(:[0-9]+)?/?")
27+
or
28+
// target is a HTTP URL to a domain on any TLD
29+
text.regexpMatch("(?i)https?://([a-z0-9-]+\\.)+([a-z]+)(:[0-9]+)?/?")
30+
)
31+
}
32+
33+
predicate incomplete_sanitization(Expr sanitizer, StrConst url) {
34+
looksLikeUrl(url) and
35+
(
36+
sanitizer.(Compare).compares(url, any(In i), _)
37+
or
38+
call_to_startswith(sanitizer, url)
39+
or
40+
unsafe_call_to_endswith(sanitizer, url)
41+
)
42+
}
43+
44+
predicate call_to_startswith(Call sanitizer, StrConst url) {
45+
sanitizer.getFunc().(Attribute).getName() = "startswith"
46+
and
47+
sanitizer.getArg(0) = url
48+
}
49+
50+
predicate unsafe_call_to_endswith(Call sanitizer, StrConst url) {
51+
sanitizer.getFunc().(Attribute).getName() = "endswith" and
52+
sanitizer.getArg(0) = url and
53+
not url.getText().regexpMatch("(?i)\\.([a-z0-9-]+)(\\.[a-z0-9-]+)+")
54+
}
55+
56+
from Expr sanitizer, StrConst url
57+
where incomplete_sanitization(sanitizer, url)
58+
select sanitizer, "'$@' may be at an arbitrary position in the sanitized URL.", url, url.getText()
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
from flask import Flask, request, redirect
2+
import re
3+
4+
app = Flask(__name__)
5+
6+
UNSAFE_REGEX = re.compile("(www|beta).example.com/")
7+
SAFE_REGEX = re.compile(r"(www|beta)\.example\.com/")
8+
9+
@app.route('/some/path/bad')
10+
def unsafe(request):
11+
target = request.args.get('target', '')
12+
if UNSAFE_REGEX.match(target):
13+
return redirect(target)
14+
15+
@app.route('/some/path/good')
16+
def safe(request):
17+
target = request.args.get('target', '')
18+
if SAFE_REGEX.match(target):
19+
return redirect(target)
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
from flask import Flask, request, redirect
2+
from urllib.parse import urlparse
3+
4+
app = Flask(__name__)
5+
6+
# Not safe, as "evil-example.net/example.com" would be accepted
7+
8+
@app.route('/some/path/bad1')
9+
def unsafe1(request):
10+
target = request.args.get('target', '')
11+
if "example.com" in target:
12+
return redirect(target)
13+
14+
# Not safe, as "benign-looking-prefix-example.com" would be accepted
15+
16+
@app.route('/some/path/bad2')
17+
def unsafe2(request):
18+
target = request.args.get('target', '')
19+
if target.endswith("example.com"):
20+
return redirect(target)
21+
22+
23+
24+
#Simplest and safest approach is to use a white-list
25+
26+
@app.route('/some/path/good1')
27+
def safe1(request):
28+
whitelist = [
29+
"example.com/home",
30+
"example.com/login",
31+
]
32+
target = request.args.get('target', '')
33+
if target in whitelist:
34+
return redirect(target)
35+
36+
#More complex example allowing sub-domains.
37+
38+
@app.route('/some/path/good2')
39+
def safe2(request):
40+
target = request.args.get('target', '')
41+
host = urlparse(target).hostname
42+
#Note the '.' preceding example.com
43+
if host and host.endswith(".example.com"):
44+
return redirect(target)
45+
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
| hosttest.py:6:27:6:51 | Str | This regular expression has an unescaped '.' before 'example.com', so it might match more hosts than expected. |
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Security/CWE-020/IncompleteHostnameRegExp.ql

0 commit comments

Comments
 (0)