Skip to content

Commit 88d8cb5

Browse files
committed
Python: Two new queries for URL and hostname sanitization (CWE-020).
1 parent ffa8b12 commit 88d8cb5

13 files changed

+385
-0
lines changed

change-notes/1.20/analysis-python.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ Removes false positives seen when using Python 3.6, but not when using earlier v
1515
|-----------------------------|-----------|--------------------------------------------------------------------|
1616
| Default version of SSL/TLS may be insecure (`py/insecure-default-protocol`) | security, external/cwe/cwe-327 | Finds instances where an insecure default protocol may be used. Results are shown on LGTM by default. |
1717
| Use of insecure SSL/TLS version (`py/insecure-protocol`) | security, external/cwe/cwe-327 | Finds instances where a known insecure protocol has been specified. Results are shown on LGTM by default. |
18+
| Incomplete regular expression for hostnames (`py/incomplete-hostname-regexp`) | security, external/cwe/cwe-020 | Finds instances where a hostname is incompletely sanitized due to enescaped character in a regular expression. Results are shown on LGTM by default. |
19+
| Incomplete URL substring sanitization (`py/incomplete-url-substring-sanitization`) | security, external/cwe/cwe-020 | Finds instances where a URL is incompletely sanitized due to insufficient checks. Results are shown on LGTM by default. |
1820

1921
## Changes to existing queries
2022

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
<!DOCTYPE qhelp PUBLIC
2+
"-//Semmle//qhelp//EN"
3+
"qhelp.dtd">
4+
<qhelp>
5+
6+
<overview>
7+
<p>
8+
9+
Sanitizing untrusted URLs is an important technique for
10+
preventing attacks such as request forgeries and malicious
11+
redirections. Often, this is done by checking that the host of a URL
12+
is in a set of allowed hosts.
13+
14+
</p>
15+
16+
<p>
17+
18+
If a regular expression implements such a check, it is
19+
easy to accidentally make the check too permissive by not escaping the
20+
<code>.</code> meta-characters appropriately.
21+
22+
Even if the check is not used in a security-critical
23+
context, the incomplete check may still cause undesirable behaviors
24+
when it accidentally succeeds.
25+
26+
</p>
27+
</overview>
28+
29+
<recommendation>
30+
<p>
31+
32+
Escape all meta-characters appropriately when constructing
33+
regular expressions for security checks, pay special attention to the
34+
<code>.</code> meta-character.
35+
36+
</p>
37+
</recommendation>
38+
39+
<example>
40+
41+
<p>
42+
43+
The following example code checks that a URL redirection
44+
will reach the <code>example.com</code> domain, or one of its
45+
subdomains.
46+
47+
</p>
48+
49+
<sample src="examples/IncompleteHostnameRegExp.py"/>
50+
51+
<p>
52+
The <code>unsafe</code> check is easy to bypass because the unescaped
53+
<code>.</code> allows for any character before
54+
<code>example.com</code>, effectively allowing the redirect to go to
55+
an attacker-controlled domain such as <code>wwwXexample.com</code>.
56+
57+
</p>
58+
<p>
59+
This vulnerability is addressed in the <code>safe</code> check, which
60+
escapes the <code>.</code> and will reject <code>wwwXexample.com</code>.
61+
62+
</p>
63+
64+
</example>
65+
66+
<references>
67+
<li>OWASP: <a href="https://www.owasp.org/index.php/Server_Side_Request_Forgery">SSRF</a></li>
68+
<li>OWASP: <a href="https://www.owasp.org/index.php/Unvalidated_Redirects_and_Forwards_Cheat_Sheet">XSS Unvalidated Redirects and Forwards Cheat Sheet</a>.</li>
69+
</references>
70+
</qhelp>
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
/**
2+
* @name Incomplete regular expression for hostnames
3+
* @description Matching a URL or hostname against a regular expression that contains an unescaped dot as part of the hostname might match more hostnames than expected.
4+
* @kind problem
5+
* @problem.severity warning
6+
* @precision high
7+
* @id py/incomplete-hostname-regexp
8+
* @tags correctness
9+
* security
10+
* external/cwe/cwe-20
11+
*/
12+
13+
import python
14+
import semmle.python.regex
15+
16+
private string commonTopLevelDomainRegex() {
17+
result = "com|org|edu|gov|uk|net|io"
18+
}
19+
20+
/**
21+
* Holds if `pattern` is a regular expression pattern for URLs with a host matched by `hostPart`,
22+
* and `pattern` contains a subtle mistake that allows it to match unexpected hosts.
23+
*/
24+
bindingset[pattern]
25+
predicate isIncompleteHostNameRegExpPattern(string pattern, string hostPart) {
26+
hostPart = pattern
27+
.regexpCapture("(?i).*" +
28+
// an unescaped single `.`
29+
"(?<!\\\\)[.]" +
30+
// immediately followed by a sequence of subdomains, perhaps with some regex characters mixed in, followed by a known TLD
31+
"([():|?a-z0-9-]+(\\\\)?[.](" + commonTopLevelDomainRegex() + "))" + ".*", 1)
32+
}
33+
34+
from Regex r, string pattern, string hostPart
35+
where
36+
(
37+
r.getText() = pattern
38+
) and
39+
isIncompleteHostNameRegExpPattern(pattern, hostPart) and
40+
// ignore patterns with capture groups after the TLD
41+
not pattern.regexpMatch("(?i).*[.](" + commonTopLevelDomainRegex() + ").*[(][?]:.*[)].*")
42+
select r,
43+
"This regular expression has an unescaped '.' before '" + hostPart +
44+
"', so it might match more hosts than expected."
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
<!DOCTYPE qhelp PUBLIC
2+
"-//Semmle//qhelp//EN"
3+
"qhelp.dtd">
4+
<qhelp>
5+
6+
<overview>
7+
<p>
8+
9+
Sanitizing untrusted URLs is an important technique for
10+
preventing attacks such as request forgeries and malicious
11+
redirections. Usually, this is done by checking that the host of a URL
12+
is in a set of allowed hosts.
13+
14+
</p>
15+
16+
<p>
17+
18+
However, it is notoriously error-prone to treat the URL as
19+
a string and check if one of the allowed hosts is a substring of the
20+
URL. Malicious URLs can bypass such security checks by embedding one
21+
of the allowed hosts in an unexpected location.
22+
23+
</p>
24+
25+
<p>
26+
27+
Even if the substring check is not used in a
28+
security-critical context, the incomplete check may still cause
29+
undesirable behaviors when the check succeeds accidentally.
30+
31+
</p>
32+
</overview>
33+
34+
<recommendation>
35+
<p>
36+
37+
Parse a URL before performing a check on its host value,
38+
and ensure that the check handles arbitrary subdomain sequences
39+
correctly.
40+
41+
</p>
42+
</recommendation>
43+
44+
<example>
45+
46+
<p>
47+
48+
The following example code checks that a URL redirection
49+
will reach the <code>example.com</code> domain.
50+
51+
</p>
52+
53+
<sample src="examples/IncompleteUrlSubstringSanitization.py"/>
54+
55+
<p>
56+
57+
The first two examples show unsafe checks that are easily bypassed.
58+
In <code>unsafe1</code> the attacker can simply add
59+
<code>example.com</code> anywhere in the url. For example,
60+
<code>http://evil-example.net/example.com</code>.
61+
</p>
62+
<p>
63+
In <code>unsafe2</code> the attacker must use a hostname ending in
64+
<code>example.com</code>, but that is easy to do. For example,
65+
<code>http://benign-looking-prefix-example.com</code>.
66+
67+
</p>
68+
69+
<p>
70+
71+
The second two examples show safe checks.
72+
In <code>safe1</code>, a white-list is used. Although fairly inflexible,
73+
this is easy to get right and is most likely to be safe.
74+
</p>
75+
<p>
76+
In <code>safe2</code>, <code>urlparse</code> is used to parse the URL,
77+
then the hostname is checked to make sure it ends with <code>.example.com</code>.
78+
</p>
79+
80+
</example>
81+
82+
<references>
83+
<li>OWASP: <a href="https://www.owasp.org/index.php/Server_Side_Request_Forgery">SSRF</a></li>
84+
<li>OWASP: <a href="https://www.owasp.org/index.php/Unvalidated_Redirects_and_Forwards_Cheat_Sheet">XSS Unvalidated Redirects and Forwards Cheat Sheet</a>.</li>
85+
</references>
86+
</qhelp>
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
/**
2+
* @name Incomplete URL substring sanitization
3+
* @description Security checks on the substrings of an unparsed URL are often vulnerable to bypassing.
4+
* @kind problem
5+
* @problem.severity warning
6+
* @precision high
7+
* @id py/incomplete-url-substring-sanitization
8+
* @tags correctness
9+
* security
10+
* external/cwe/cwe-20
11+
*/
12+
13+
14+
import python
15+
import semmle.python.regex
16+
17+
private string commonTopLevelDomainRegex() {
18+
result = "com|org|edu|gov|uk|net|io"
19+
}
20+
21+
predicate looksLikeUrl(StrConst s) {
22+
exists(string text |
23+
text = s.getText()
24+
|
25+
text.regexpMatch("(?i)([a-z]*:?//)?\\.?([a-z0-9-]+\\.)+(" +
26+
commonTopLevelDomainRegex() +")(:[0-9]+)?/?")
27+
or
28+
// target is a HTTP URL to a domain on any TLD
29+
text.regexpMatch("(?i)https?://([a-z0-9-]+\\.)+([a-z]+)(:[0-9]+)?/?")
30+
)
31+
}
32+
33+
predicate incomplete_sanitization(Expr sanitizer, StrConst url) {
34+
looksLikeUrl(url) and
35+
(
36+
sanitizer.(Compare).compares(url, any(In i), _)
37+
or
38+
call_to_startswith(sanitizer, url)
39+
or
40+
unsafe_call_to_endswith(sanitizer, url)
41+
)
42+
}
43+
44+
predicate call_to_startswith(Call sanitizer, StrConst url) {
45+
sanitizer.getFunc().(Attribute).getName() = "startswith"
46+
and
47+
sanitizer.getArg(0) = url
48+
}
49+
50+
predicate unsafe_call_to_endswith(Call sanitizer, StrConst url) {
51+
sanitizer.getFunc().(Attribute).getName() = "endswith" and
52+
sanitizer.getArg(0) = url and
53+
not url.getText().regexpMatch("(?i)\\.([a-z0-9-]+)(\\.[a-z0-9-]+)+")
54+
}
55+
56+
from Expr sanitizer, StrConst url
57+
where incomplete_sanitization(sanitizer, url)
58+
select sanitizer, "'$@' may be at an arbitrary position in the sanitized URL.", url, url.getText()
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
from flask import Flask, request, redirect
2+
import re
3+
4+
app = Flask(__name__)
5+
6+
UNSAFE_REGEX = re.compile("(www|beta).example.com/")
7+
SAFE_REGEX = re.compile(r"(www|beta)\.example\.com/")
8+
9+
@app.route('/some/path/bad')
10+
def unsafe(request):
11+
target = request.args.get('target', '')
12+
if UNSAFE_REGEX.match(target):
13+
return redirect(target)
14+
15+
@app.route('/some/path/good')
16+
def safe(request):
17+
target = request.args.get('target', '')
18+
if SAFE_REGEX.match(target):
19+
return redirect(target)
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
from flask import Flask, request, redirect
2+
from urllib.parse import urlparse
3+
4+
app = Flask(__name__)
5+
6+
@app.route('/some/path/bad1')
7+
def unsafe1(request):
8+
target = request.args.get('target', '')
9+
if "example.com" in target:
10+
return redirect(target)
11+
12+
@app.route('/some/path/bad2')
13+
def unsafe2(request):
14+
target = request.args.get('target', '')
15+
if target.endswith("example.com"):
16+
return redirect(target)
17+
18+
19+
20+
#Simplest and safest approach is to use a white-list
21+
22+
@app.route('/some/path/good1')
23+
def safe1(request):
24+
whitelist = [
25+
"example.com/home",
26+
"example.com/login",
27+
]
28+
target = request.args.get('target', '')
29+
if target in whitelist:
30+
return redirect(target)
31+
32+
#More complex example allowing sub-domains.
33+
34+
@app.route('/some/path/good2')
35+
def safe2(request):
36+
target = request.args.get('target', '')
37+
host = urlparse(target).hostname
38+
#Note the '.' preceding example.com
39+
if host and host.endswith(".example.com"):
40+
return redirect(target)
41+
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
| hosttest.py:6:27:6:51 | Str | This regular expression has an unescaped '.' before 'example.com', so it might match more hosts than expected. |
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Security/CWE-020/IncompleteHostnameRegExp.ql
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
| urltest.py:9:8:9:30 | Compare | '$@' may be at an arbitrary position in the sanitized URL. | urltest.py:9:8:9:20 | Str | example.com |
2+
| urltest.py:15:8:15:37 | Attribute() | '$@' may be at an arbitrary position in the sanitized URL. | urltest.py:15:24:15:36 | Str | example.com |

0 commit comments

Comments
 (0)