Taint analysis
Semgrep supports taint analysis (or taint tracking) through taint rules (specified by adding mode: taint to your rule). Taint analysis is a data-flow analysis that tracks the flow of untrusted, or tainted data throughout the body of a function or method. Tainted data originate from tainted sources. If tainted data is not transformed or checked accordingly (sanitized), taint analysis reports a finding whenever tainted data reach a vulnerable function, called a sink. Tainted data flow from sources to sinks through propagators, such as assignments, or function calls.
The following video provides a quick overview of taint mode:
Getting started
Taint tracking rules must specify mode: taint, which enables the following operators:
pattern-sources(required)pattern-propagators(optional)pattern-sanitizers(optional)pattern-sinks(required)
These operators (which act as pattern-either operators) take a list of patterns that specify what is considered a source, a propagator, a sanitizer, or a sink. Note that you can use any pattern operator and you have the same expressive power as in a mode: search rule.
For example:
Here Semgrep tracks the data returned by get_user_input(), which is the source of taint. Think of Semgrep running the pattern get_user_input(...) on your code, finding all places where get_user_input gets called, and labeling them as tainted. That is exactly what is happening under the hood!
The rule specifies the sanitizer sanitize_input(...), so any expression that matches that pattern is considered sanitized. In particular, the expression sanitize_input(data) is labeled as sanitized. Even if data is tainted, as it occurs inside a piece of sanitized code, it does not produce any findings.
Finally, the rule specifies that anything matching either html_output(...) or eval(...) should be regarded as a sink. There are two calls html_output(data) that are both labeled as sinks. The first one in route1 is not reported because data is sanitized before reaching the sink, whereas the second one in route2 is reported because the data that reaches the sink is still tainted.
You can find more examples of taint rules in the Semgrep Registry, for instance: express-sandbox-code-injection.
Metavariables used in pattern-sources are considered different from those used in pattern-sinks, even if they have the same name! See Metavariables, rule message, and unification for further details.
Sources
A taint source is specified by a pattern. Like in a search-mode rule, you can start this pattern with one of the following keys: pattern, patterns, pattern-either, pattern-regex. Note that any subexpression that is matched by this pattern will be regarded as a source of taint.
In addition, taint sources accept the following options:
| Option | Type | Default | Description |
|---|---|---|---|
exact | true | false | See Exact sources. |
by-side-effect | only | false | See Sources by side-effect. |
control (Pro) 🧪 | true | false | See Control sources. |
Example:
pattern-sources:
- pattern: source(...)
Exact sources
Given the source specification below, and a piece of code such as source(sink(x)), the call sink(x) is reported as a tainted sink.
pattern-sources:
- pattern: source(...)
The reason is that the pattern source(...) matches all of source(sink(x)), and that makes Semgrep consider every subexpression in that piece of code as being a source. In particular, x is a source, and it is being passed into sink!
This is the default for historical reasons, but it may change in the future.
It is possible to instruct Semgrep to only consider as taint sources the "exact" matches of a source pattern by setting exact: true:
pattern-sources:
- pattern: source(...)
exact: true
Once the source is "exact," Semgrep will no longer consider subexpressions as taint sources, and sink(x) inside source(sink(x)) will not be reported as a tainted sink (unless x is tainted in some other way).
For many rules this distinction is not very meaningful because it does not always make sense that a sink occurs inside the arguments of a source function.
If one of your rules relies on non-exact matching of sources, we advice you to make it explicit with exact: false, even if it is the current default, so that your rule does not break if the default changes.
Sources by side-effect
Consider the following hypothetical Python code, where make_tainted is a function that makes its argument tainted by side-effect:
make_tainted(my_set)
sink(my_set)
This kind of source can be specified by setting by-side-effect: true:
pattern-sources:
- patterns:
- pattern: make_tainted($X)
- focus-metavariable: $X
by-side-effect: true
When this option is enabled, and the source specification matches a variable (or in general, an l-value) exactly, then Semgrep assumes that the variable (or l-value) becomes tainted by side-effect at the precise places where the source specification produces a match.
The matched occurrences themselves are considered tainted; that is, the occurrence of x in make_tainted(x) is itself tainted too. If you do not want this to be the case, then set by-side-effect: only instead.
You must use focus-metavariable: $X to focus the match on the l-value that you want to taint, otherwise by-side-effect does not work.
If the source does not set by-side-effect, then only the very occurrence of x in make_tainted(x) will be tainted, but not the occurrence of x in sink(x). The source specification matches only the first occurrence and, without by-side-effect: true, Semgrep does not know that make_tainted is updating the variable x by side-effect. Thus, a taint rule using such a specification does not produce any finding.
You could be tempted to write a source specification as the following example (and this was the official workaround before by-side-effect):
pattern-sources:
- patterns:
- pattern-inside: |
make_tainted($X)
...
- pattern: $X
This tells Semgrep that every occurrence of $X after make_tainted($X) must be considered a source.
This approach has two main limitations. First, it overrides any sanitization that can be performed on the code matched by $X. In the example code below, the call sink(x) is reported as tainted despite x having been sanitized!
make_tainted(x)
x = sanitize(x)
sink(x) # false positive
Note also that ... ellipses operator has limitations. For example, in the code below Semgrep does not match any finding if such source specification is in use:
if cond:
make_tainted(x)
sink(x) # false negative
The by-side-effect option was added precisely to address those limitations. However, that kind of workaround can still be useful in other situations!
Function arguments as sources
To specify that an argument of a function must be considered a taint source, simply write a pattern that matches that argument:
pattern-sources:
- patterns:
- pattern-inside: |
def foo($X, ...):
...
- focus-metavariable: $X
Note that the use of focus-metavariable: $X is very important, and using pattern: $X is not equivalent. With focus-metavariable: $X, Semgrep matches the formal parameter exactly. Click "Open in Playground" below and use "Inspect Rule" to visualize what the source is matching.
The following example does the same with this other taint rule that uses pattern: $X. The pattern: $X does not match the formal parameter itself, but matches all its uses inside the function definition. Even if x is sanitized via x = sanitize(x), the occurrence of x inside sink(x) is a taint source itself (due to pattern: $X) and so sink(x) is tainted!