Shared: Improve sensitive data heuristics#20024
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR improves sensitive data heuristics across multiple programming languages by enhancing the regular expressions used to identify sensitive data patterns. The changes add new matching patterns for various types of sensitive information while also improving exclusions to reduce false positives.
- Expands detection patterns for authentication data (oauth, api keys, mfa), financial information (cvv, iban, routing numbers), health data (gender, patient records), and device information (IP addresses, MAC addresses, fingerprints)
- Adds "confidential" to secret detection patterns and improves the trusted pattern exclusion
- Enhances exclusion patterns to reduce false positives for file paths, URLs, and accounting-related terms
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| swift/ql/lib/codeql/swift/security/internal/SensitiveDataHeuristics.qll | Updates sensitive data detection patterns for Swift |
| rust/ql/lib/codeql/rust/security/internal/SensitiveDataHeuristics.qll | Updates sensitive data detection patterns for Rust |
| ruby/ql/lib/codeql/ruby/security/internal/SensitiveDataHeuristics.qll | Updates sensitive data detection patterns for Ruby |
| python/ql/lib/semmle/python/security/internal/SensitiveDataHeuristics.qll | Updates sensitive data detection patterns for Python |
| javascript/ql/lib/semmle/javascript/security/internal/SensitiveDataHeuristics.qll | Updates sensitive data detection patterns for JavaScript |
| rust/ql/test/library-tests/sensitivedata/test.rs | Updates test expectations to reflect improved heuristics |
|
Based on DCA results across all affected languages I think I'm going to:
|
|
I have no strong opinions on the changes to the heuristics - if the changes are tested and seem to improve things, then I'm all 👍 .
|
#19984 does slightly more, but I'm not sure how to judge the DCA results (which is why it's in draft; happy to open it up for review). |
I'll take a look...
I would love to, I'll raise the idea in Q2 planning. |
I think the main motivation there was in the context of "sensitive data stored unencrypted / in plaintext" queries but feel free to enable it if you're happy with the results |
|
Ah that's helpful context - I think I will try enabling the e-mail heuristic in a separate follow-up PR, but pay very close attention to results on the sensitive data storage queries. It won't be too much effort lost if it doesn't work out. |
|
Merged in latest |
aibaars
left a comment
There was a problem hiding this comment.
Changes look good to me. A bit curious about _iter suffix though.
| */ | ||
| string maybeSecret() { result = "(?is).*((?<!is|is_)secret|(?<!un|un_|is|is_)trusted).*" } | ||
| string maybeSecret() { | ||
| result = "(?is).*((?<!is|is_)secret|(?<!un|un_|is|is_)trusted(?!_iter)|confidential).*" |
There was a problem hiding this comment.
What is the reason for adding (?!_iter) ?
There was a problem hiding this comment.
In Rust there are library methods called from_trusted_iterator, where "trusted" means it's trusted not to overrun the container (or something along those lines), not anything to do with secret or confidential data. There's actually quite a lot of flow from those calls to sinks, i.e. it's causing a good number of false positive results, not to mention wobble in nightly DCA runs. So the exclusion is a bit specific, but I'm very keen to add it.
There is a test case sink(MyArray::from_trusted_iterator(iter)); that resembles (simplified) what we see in the wild.
There was a problem hiding this comment.
Ah makes sense, thanks for clarifying.
Improve sensitive data heuristics, by adding new matching heuristics and new exclusions. This is very, very loosely based on an initial version by Copilot, though I had to cut or refine the majority of those suggestions to get good results (testing with Rust).
I will start DCA runs on all affected languages. I do anticipate some further refinements may be required as common patterns vary slightly between languages.
There are three things I'd like opinions about:
e(mail|_mail)heuristic? Its labelled with a comment "seems too noisy", but I don't understand what motivated that - it always worked very well in my tests (Swift + Rust).uidheuristic (([uU]|^|_|[a-z](?=U))([uU][iI][dD]))? I believe its responsible for a lot of false positive results because UID might stand for "User ID" (likely sensitive) or just "Unique ID" (which may or may not be sensitive, often they aren't). What are your experiences? (@andersfugmann FYI as we discussed this recently)