|
| 1 | +Summary-based information flow analysis |
| 2 | +======================================= |
| 3 | + |
| 4 | +Overview |
| 5 | +-------- |
| 6 | + |
| 7 | +This document presents an approach for running information flow analyses (such as the standard |
| 8 | +Semmle security queries) on an application that depends on one or more npm packages. Instead of |
| 9 | +installing the npm packages during the snapshot build and analyzing them together with application |
| 10 | +code, we analyze each package in isolation and compute *flow summaries* that record information |
| 11 | +about any sources, sinks and flow steps contributed by the package's API. These flow summaries |
| 12 | +are then imported when building a snapshot of the application (usually in the form of CSV files |
| 13 | +added as external data), and are picked up by the standard security queries, allowing them to reason |
| 14 | +about flow into, out of and through the npm packages as though they had been included as part of the |
| 15 | +build. |
| 16 | + |
| 17 | +Motivating example |
| 18 | +------------------ |
| 19 | + |
| 20 | +Let us take the `mkdirp <https://www.npmjs.com/package/mkdirp>`_ package as an example. It exports |
| 21 | +a function that takes as its first argument a file system path, and creates a folder with that |
| 22 | +path, as well as any parent folders that do not exist yet. As further arguments, the function |
| 23 | +accepts an optional configuration object and a callback to invoke once the folder has been |
| 24 | +created. |
| 25 | + |
| 26 | +An application might use this package as follows: |
| 27 | + |
| 28 | +.. code-block:: js |
| 29 | +
|
| 30 | + const mkdirp = require('mkdirp'); |
| 31 | + // ... |
| 32 | + mkdirp(p, opts, function cb(err) { |
| 33 | + // ... |
| 34 | + }); |
| 35 | +
|
| 36 | +If the value of ``p`` can be controlled by an untrusted user, this would allow them to create arbitrary |
| 37 | +folders, which may not be desirable. |
| 38 | + |
| 39 | +By analyzing the application code base together with the source code for the ``mkdirp`` package, |
| 40 | +Semmle's default path injection analysis would be able to track taint through the call to ``mkdirp`` into its |
| 41 | +implementation, which ultimately uses built-in Node.js file system APIs to create the folder. Since |
| 42 | +the path injection analysis has built-in models of these APIs it would then be able to spot and flag this |
| 43 | +vulnerability. |
| 44 | + |
| 45 | +However, analyzing ``mkdirp`` from scratch for every client application is wasteful. Moreover, it would |
| 46 | +in this case be undesirable to flag the location inside ``mkdirp`` where the folder is actually created |
| 47 | +as part of the alert: the developer of the client application did not write that code and hence will |
| 48 | +have a hard time understanding why it is being flagged. |
| 49 | + |
| 50 | +Both of these concerns can be addressed by treating the first argument to ``mkdirp`` as a path injection |
| 51 | +sink in its own right: the analysis no longer needs to track flow into the implementation of ``mkdirp``, |
| 52 | +so we would no longer need to include its source code in the analysis, and the alert would flag the call |
| 53 | +to ``mkdirp`` in application code, not its implementation in library code. |
| 54 | + |
| 55 | +The information that the first parameter of ``mkdirp`` is interpreted as a file system path and hence should |
| 56 | +be considered a path injection sink is an example of a *flow summary*, or more precisely a *sink summary*. |
| 57 | +Besides sink summaries, we also consider *source summaries* and *flow-step summaries*. |
| 58 | + |
| 59 | +In general, a sink summary states that some API interface point (such as a function parameter) should |
| 60 | +be considered a sink for a certain analysis, so if data from a known source reaches this point without |
| 61 | +undergoing appropriate sanitization, it should be flagged with an alert. A sink summary may also |
| 62 | +specify which taint kind the data needs to have in order for the sink to be problematic. |
| 63 | + |
| 64 | +Conversely, a source summary identifies some API (such as the return value of a function) as a source |
| 65 | +of tainted data for a certain analysis, again optionally specifying a taint kind. |
| 66 | + |
| 67 | +Finally, a flow-step summary records the fact that data that flows into the package at some point |
| 68 | +may propagate to another point (for example, from a function parameter to its return value). |
| 69 | +In this case, there are two relevant taint kinds, one describing the kind of taint data has that |
| 70 | +enters, and one describing the taint of the data that emerges. In general, flow steps (like sources |
| 71 | +and sinks) are analysis-specific, since we need to know about sanitizers. |
| 72 | + |
| 73 | +In what follows we will first discuss how summaries are generated from a snapshot of an npm package, |
| 74 | +and then how they are imported when analyzing client code. Finally, we will discuss the format in which |
| 75 | +flow summaries are stored. |
| 76 | + |
| 77 | +Note that flow summaries are considered an experimental feature at this point. Using them involves |
| 78 | +some manual configuration, and we make no guarantee that the API will remain stable. |
| 79 | + |
| 80 | +Generating summaries |
| 81 | +-------------------- |
| 82 | + |
| 83 | +Flow summaries of an npm package can be generated by running special summary extraction queries |
| 84 | +either on a snapshot of the package itself, or on a snapshot of a hand-written model of the |
| 85 | +package. (Note that this requires a working installation of Semmle Core.) |
| 86 | + |
| 87 | +There are three default summary extraction queries: |
| 88 | + |
| 89 | +- Extract flow step summaries (``js/step-summary-extraction``, |
| 90 | + ``Security/Summaries/ExtractSourceSummaries.ql``) |
| 91 | +- Extract sink summaries (``js/sink-summary-extraction``, |
| 92 | + ``Security/Summaries/ExtractSinkSummaries.ql``) |
| 93 | +- Extract source summaries (``js/source-summary-extraction``, |
| 94 | + ``Security/Summaries/ExtractSourceSummaries.ql``) |
| 95 | + |
| 96 | +You can run these queries individually against a snapshot of the npm package you want to create |
| 97 | +flow summaries for using ``odasa runQuery``, and store the output as CSV files named |
| 98 | +``additional-steps.csv``, ``additional-sinks.csv`` and ``additional-sources.csv``, respectively. |
| 99 | + |
| 100 | +For example, assuming that folder ``mkdirp-snapshot`` contains a snapshot of the ``mkdirp`` |
| 101 | +project, we can extract sink summaries using the command |
| 102 | + |
| 103 | +.. code-block:: bash |
| 104 | +
|
| 105 | + odasa runQuery \ |
| 106 | + --query $SEMMLE_DIST/queries/semmlecode-javascript-queries/Security/Summaries/ExtractSinkSummaries.ql \ |
| 107 | + --output-file additional-sinks.csv --snapshot mkdirp-snapshot |
| 108 | +
|
| 109 | +
|
| 110 | +Instead of generating summaries directly from the package source code, you can also generate |
| 111 | +them from a hand-written model of the package. The model should contain a ``package.json`` file |
| 112 | +giving the correct package name, and models for the relevant API entry points. The models are |
| 113 | +plain JavaScript with special comments annotating certain expressions as sources or sinks. |
| 114 | + |
| 115 | +For example, a model of ``mkdirp`` might look like this: |
| 116 | + |
| 117 | +.. code-block:: js |
| 118 | +
|
| 119 | + module.exports = function mkdirp(path) { |
| 120 | + path /* Semmle: sink: taint, TaintedPath */ |
| 121 | + }; |
| 122 | +
|
| 123 | +Annotation comments start with ``Semmle:``, and contain ``source`` and ``sink`` specifications. |
| 124 | +Each such specification lists a flow label (in this case, ``taint``) and a configuration to which |
| 125 | +the specification applies (in this case, ``TaintedPath``). |
| 126 | + |
| 127 | +A source specification annotates an expression as being a source of flow with the given label |
| 128 | +for the purposes of the given configuration, and similar for sinks. Annotation comments apply to |
| 129 | +any expression (and more generally any data flow node) whose source location ends on the line |
| 130 | +where the comment starts. |
| 131 | + |
| 132 | +Using summaries |
| 133 | +--------------- |
| 134 | + |
| 135 | +Once you have created summaries using the approach outlined above, you have two options for |
| 136 | +including them in the analysis of a client application. |
| 137 | + |
| 138 | +External data |
| 139 | +::::::::::::: |
| 140 | + |
| 141 | +Firstly, you can include the CSV files generated by running the extraction queries as external |
| 142 | +data when building a snapshot of the client application by copying them into the |
| 143 | +``$snapshot/external/data`` folder. This is typically done by including a command like this |
| 144 | +in your ``project`` file: |
| 145 | + |
| 146 | +.. code-block:: xml |
| 147 | +
|
| 148 | + <build>cp /path/to/additional-sinks.csv ${snapshot}/external/data</build> |
| 149 | +
|
| 150 | +If you want to include summaries for multiple libraries, you have to concatenate the |
| 151 | +corresponding CSV files before copying them into the external data folder. |
| 152 | + |
| 153 | +Additionally, you need to import the library ``Security.Summaries.ImportFromCsv`` in your |
| 154 | +``javascript.qll``, which will pick up the summaries from external data and interpret them |
| 155 | +as additional sources, sinks and flow steps: |
| 156 | + |
| 157 | +.. code-block:: ql |
| 158 | +
|
| 159 | + import Security.Summaries.ImportFromCsv |
| 160 | +
|
| 161 | +After these preparatory steps, you can run your analysis without any further changes. |
| 162 | + |
| 163 | +External predicates |
| 164 | +::::::::::::::::::: |
| 165 | + |
| 166 | +The second method for including flow summaries is by including the |
| 167 | +``Security.Summaries.ImportFromExternalPredicates`` library in your analysis, which declares |
| 168 | +three external predicates ``additionalSteps``, ``additionalSinks`` and ``additionalSources`` that |
| 169 | +need to be instantiated with the flow summary CSV data. |
| 170 | + |
| 171 | +This is most easily done in QL for Eclipse, which will prompt you for CSV files to populate |
| 172 | +the three predicates. |
| 173 | + |
| 174 | +This approach has the advantage that you do not need to include the CSV files during the |
| 175 | +snapshot build, so you can use an existing snapshot, for example as downloaded from LGTM.com. |
| 176 | + |
| 177 | +Summary format |
| 178 | +-------------- |
| 179 | + |
| 180 | +Source and sink summaries are specified as tuples of the form ``(portal, kind, configuration)``, |
| 181 | +where ``portal`` is a description of the API element being marked as a source or sink, ``kind`` |
| 182 | +is a flow label (also known as "taint kind") describing the kind of information being generated |
| 183 | +or consumed, and ``configuration`` specifies which flow configuration the summary applies to. |
| 184 | + |
| 185 | +If ``kind`` is empty, it defaults to ``data`` for sources and either ``data`` or ``taint`` for sinks. |
| 186 | +If ``configuration`` is empty, the specification applies to all configurations. |
| 187 | +The default extraction queries never produce empty ``kind`` or ``configuration`` columns. |
| 188 | + |
| 189 | +Similarly, step summaries are tuples of the form |
| 190 | +``(inPortal, inKind, outPortal, outKind, configuration)``, stating that information with label |
| 191 | +``inKind`` that flows into ``inPortal`` resurfaces from ``outPortal``, now having kind ``outKind``. |
| 192 | +As before, ``configuration`` specifies which configuration this information applies to. |
| 193 | + |
| 194 | +In all of the above, ``portal`` is an S-expression that abstractly describes a *portal*, that is, |
| 195 | +an API interface point by which data may enter or leave the npm package being analyzed. |
| 196 | + |
| 197 | +Currently, we model five kinds of portals: |
| 198 | + |
| 199 | +- ``(root <uri>)``, representing the ``module`` object of the main module of the npm package |
| 200 | + described by ``<uri>``, which is a URL of the form ``https://www.npmjs.com/package/<pkg>``; |
| 201 | +- ``(member <name> <base>)``, representing property ``<name>`` of an object described by |
| 202 | + portal ``<base>``; |
| 203 | +- ``(instance <base>)``, representing an instance of a (constructor) function or class |
| 204 | + described by portal ``base``; |
| 205 | +- ``(parameter <i> <base>)``, representing the ``i`` th parameter of a function described by |
| 206 | + portal ``base``; |
| 207 | +- ``(return <base>)``, representing the return value of a function described by portal ``base``. |
| 208 | + |
| 209 | +In our example above, the first parameter of the default export of package ``mkdirp`` is |
| 210 | +described by the portal |
| 211 | + |
| 212 | +.. code-block:: lisp |
| 213 | +
|
| 214 | + (parameter (member (root https://www.npmjs.com/package/mkdirp) default) 0) |
| 215 | +
|
| 216 | +As a more complicated example, |
| 217 | + |
| 218 | +.. code-block:: lisp |
| 219 | +
|
| 220 | + (parameter (parameter (member (instance (member (root https://www.npmjs.com/package/bluebird) Promise)) then) 1) 0) |
| 221 | +
|
| 222 | +describes the first parameter of a function passed as second argument to the ``then`` method of |
| 223 | +the ``Promise`` constructor exported by package ``bluebird``. |
0 commit comments