Skip to content

Commit eead7f6

Browse files
authored
Merge pull request #1610 from xiemaisi/js/library-customizations
JavaScript: Start documenting extension points provided by the standard library.
2 parents f3f89ff + 50b1ddf commit eead7f6

File tree

1 file changed

+256
-0
lines changed

1 file changed

+256
-0
lines changed
Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
Customizing the JavaScript analysis
2+
===================================
3+
4+
This document describes the main extension points offered by the JavaScript analysis for customizing
5+
analysis behavior without editing the queries or libraries themselves.
6+
7+
Customization mechanisms
8+
------------------------
9+
10+
The two mechanisms used for customization are subclassing and overriding.
11+
12+
We can teach the JavaScript analysis to handle further instances of abstract concepts it already
13+
understands by subclassing abstract classes and implementing their member predicates. For example,
14+
the standard library defines an abstract class ``SystemCommandExecution`` that covers various APIs
15+
for executing operating-system commands. This class is used by the command-injection analysis to
16+
identify problematic flows where input from a potentially malicious user is interpreted as the name
17+
of a system command to execute. By defining additional subclasses of ``SystemCommandExecution``, we
18+
can make this analysis more powerful without touching its implementation.
19+
20+
By overriding a member predicate defined in the library, we can change its behavior either for all
21+
its receivers or only a subset. For example, the standard library predicate
22+
``ControlFlowNode::getASuccessor`` implements the basic control-flow graph on which many further
23+
analyses are based. By overriding it, we can add, suppress, or modify control-flow graph edges.
24+
25+
Once a customization has been defined, it needs to be brought into scope so that the default
26+
analysis queries pick it up. This can be done by adding the customizing definitions to
27+
``Customizations.qll``, an initially empty library file that is imported by the default library
28+
``javascript.qll``.
29+
30+
Sometimes you may want to perform both kinds of customizations at the same time. That is, subclass a base
31+
class to provide new implementations of an API, and override some member predicates of the same base
32+
class to selectively change the implementation of the API. This is not always easy to do, since the
33+
former requires the base class to be abstract, while the latter requires it to be concrete.
34+
35+
To work around this, the JavaScript library uses the so-called *range pattern*. In this pattern, the base class
36+
``Base`` itself is concrete, but it has an abstract companion class called ``Base::Range`` covering
37+
the same set of values. To change the implementation of the API, subclass ``Base`` and override its
38+
member predicates. To provide new implementations of the API, subclass ``Base::Range`` and implement
39+
its abstract member predicates.
40+
41+
For example, the class ``Base64::Encode`` in the standard library models base64-encoding libraries
42+
using the range pattern. It comes with subclasses corresponding to many popular base64 encoders. To
43+
add support for a new library, subclass ``Base64::Encode::Range`` and implement the member
44+
predicates ``getInput`` and ``getOutput``. To customize the definition of ``getInput`` or
45+
``getOutput`` for a library that is already supported, extend ``Base64::Encode`` itself and override
46+
the predicate you want to customize.
47+
48+
Note that currently the range pattern is not used everywhere yet, so you will find some abstract
49+
classes without a concrete companion. We are planning on eventually migrating most abstract classes
50+
to use the range pattern.
51+
52+
Analysis layers
53+
---------------
54+
55+
The JavaScript analysis libraries have a layered structure with higher-level analyses based on
56+
lower-level ones. Usually, classes and predicates in a lower layer should not depend on a higher
57+
layer to avoid performance problems and non-monotonic recursion.
58+
59+
In this section, we briefly introduce the most important analysis layers, starting from the lowest
60+
layer. Below, we discuss the extension points offered by the individual layers.
61+
62+
Abstract syntax tree
63+
~~~~~~~~~~~~~~~~~~~~
64+
65+
The abstract syntax tree (AST), implemented by class ``ASTNode`` and its subclasses, is the lowest layer
66+
and is a good representation of the information stored in the snapshot database. It
67+
corresponds closely to the syntactic structure of the program, only abstracting away from
68+
typographical details such as whitespace and indentation.
69+
70+
Control-flow graph
71+
~~~~~~~~~~~~~~~~~~
72+
73+
The (intra-procedural) control-flow graph (CFG), implemented by class ``ControlFlowNode`` and its
74+
subclasses, is the next layer. It models flow of control inside functions and top-level scripts. The
75+
CFG is overlaid on top of the AST, meaning that each AST node has a corresponding CFG node. There
76+
are also synthetic CFG nodes that do not correspond to an AST node. For example, entry and exit
77+
nodes (``ControlFlowEntryNode`` and ``ControlFlowExitNode``) mark the beginning and end,
78+
respectively, of the execution of a function or top-level script, while guard nodes
79+
(``GuardControlFlowNode``) record that some condition is known to hold at some point in the program.
80+
81+
Basic blocks (class ``BasicBlock``) organize control-flow nodes into maximal sequences of
82+
straight-line code, which is vital for efficiently reasoning about control flow.
83+
84+
Static single-assignment form
85+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
86+
87+
The static single-assignment (SSA) representation (class ``SsaVariable`` and ``SsaDefinition``) uses
88+
control-flow information to split up local variables into SSA variables that each only have a single
89+
definition.
90+
91+
In addition to regular definitions corresponding to assignments and increment/decrement expressions,
92+
the SSA form also introduces pseudo-definitions such as
93+
94+
- *phi nodes*, where multiple possible values for a variable are merged
95+
- *refinement nodes* (also known as *pi nodes*) marking program points where additional information about a variable becomes available that may restrict its possible set of values.
96+
97+
Local data flow
98+
~~~~~~~~~~~~~~~
99+
100+
The (intra-procedural) data-flow graph, implemented by class ``DataFlow::Node`` and its subclasses,
101+
represents the flow of data within a function or top-level scripts. Each expression has a
102+
corresponding data-flow node. Additionally, there are data-flow nodes that do not correspond to
103+
syntactic elements. For example, each SSA variable has a corresponding data-flow node. Note that
104+
flow between functions (through arguments and return values) is not modeled in this layer, except
105+
for the special case of immediately-invoked function expressions. Flow through object properties is
106+
also not modeled.
107+
108+
This layer also implements the widely-used source-node API. The class ``DataFlow::SourceNode`` and its
109+
subclasses represent data-flow nodes where new objects are created (such as object expressions), or
110+
where non-local data flow enters the intra-procedural data-flow graph (such as function parameters
111+
or property reads). The source-node API provides convenient predicates for reasoning about these
112+
nodes without having to explicitly encode data-flow graph traversal.
113+
114+
Type inference
115+
~~~~~~~~~~~~~~
116+
117+
Class ``AnalyzedNode`` and its subclasses implement (intra-procedural) type inference on top of the
118+
local data-flow graph. Some reasoning about properties is implemented as well, but more advanced
119+
features such as the prototype chain are not considered.
120+
121+
Call graph
122+
~~~~~~~~~~
123+
124+
The call graph is implemented as a predicate ``getACallee`` on ``DataFlow::InvokeNode``, the class
125+
of data-flow nodes representing function calls (with or without ``new``). It uses local data flow and
126+
type information, as well as type annotations where available.
127+
128+
Type tracking
129+
~~~~~~~~~~~~~
130+
131+
The type-tracking framework (classes ``DataFlow::TypeTracker`` and ``DataFlow::TypeBackTracker``) is
132+
a library for implementing custom type inference systems that track values inter-procedurally,
133+
including tracking through one level of object properties.
134+
135+
Framework models
136+
~~~~~~~~~~~~~~~~
137+
138+
The libraries under ``semmle/javascript/frameworks`` model a broad range of popular JavaScript
139+
libraries and frameworks, such as Express and Vue.js. Some framework modeling libraries are located
140+
under ``semmle/javascript`` directly, for instance ``Base64``, ``EmailClients``, and ``JsonParsers``.
141+
142+
Global data flow and taint tracking
143+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
144+
145+
The inter-procedural data flow and taint-tracking libraries can be used to implement static
146+
information-flow analyses. Most of our security queries are based on this approach.
147+
148+
Extension points
149+
----------------
150+
151+
In this section, we discuss the most important extension points for the individual analysis layers introduced
152+
above.
153+
154+
AST
155+
~~~
156+
157+
This layer should not normally be customized. It is technically possible to override, for instance,
158+
``ASTNode.getChild`` to change the way the AST structure is represented, but this should normally be
159+
avoided in the interest of keeping a close correspondence between AST and concrete syntax.
160+
161+
CFG
162+
~~~
163+
164+
You can override ``ControlFlowNode.getASuccessor`` to customize the control-flow graph. Note that
165+
overriding ``ControlFlowNode.getAPredecessor`` is not normally useful, since it is rarely used in
166+
higher-level libraries.
167+
168+
SSA
169+
~~~
170+
171+
It is not normally necessary to customize this layer.
172+
173+
Local data flow
174+
~~~~~~~~~~~~~~~
175+
176+
The ``DataFlow::SourceNode`` class uses the range pattern, so new kinds of source nodes can be
177+
added by extending ``Dataflow::SourceNode::Range``. Some of its subclasses can similarly be
178+
extended. For example, ``DataFlow::ModuleImportNode`` models module imports, and ``DataFlow::ClassNode`` models
179+
class definitions. The former provides default implementations covering CommonJS, AMD, and ECMAScript
180+
2015 modules, while the latter handles ECMAScript 2015 classes, as well as traditional function-based
181+
classes. You can extend their corresponding ``::Range`` classes to add support for other module or
182+
class systems.
183+
184+
Type inference
185+
~~~~~~~~~~~~~~
186+
187+
You can override ``AnalyzedNode::getAValue`` to customize the type inference. Note that the type
188+
inference is expected to be sound, that is (as far as practical), the abstract values inferred for a
189+
data-flow node should cover all possible concrete values this node may take on at runtime.
190+
191+
You can also extend the set of abstract values. To add individual abstract values that are
192+
independent of the program being analyzed, define a subclass of ``CustomAbstractValueTag``
193+
describing the new abstract value. There will then be a corresponding value of class
194+
``CustomAbstractValue`` that you can use in overriding definitions of the ``getAValue`` predicate.
195+
196+
Call graph
197+
~~~~~~~~~~
198+
199+
You can override ``DataFlow::InvokeNode::getACallee(int)`` to customize the call graph. Note that
200+
overriding the zero-argument version ``getACallee()`` is not enough, since higher layers use the
201+
one-argument version.
202+
203+
Type tracking
204+
~~~~~~~~~~~~~
205+
206+
It is not normally necessary to customize this layer.
207+
208+
Framework models
209+
~~~~~~~~~~~~~~~~
210+
211+
The ``semmle.javascript.frameworks.HTTP`` module defines many abstract classes that can be extended
212+
to implement support for new web server frameworks. These classes, in turn, are used by some of the
213+
security queries (such as the reflected cross-site scripting query) to define sources and sinks, so
214+
these queries will automatically benefit from the additional modeling.
215+
216+
Similarly, the ``semmle.javascript.frameworks.ClientRequests`` module defines an abstract class for
217+
modeling client-side HTTP requests. It comes with built-in support for a number of popular
218+
frameworks, and you can add support for new frameworks by extending the abstract class.
219+
220+
The ``semmle.javascript.frameworks.SQL`` module defines abstract classes for modeling SQL
221+
connector libraries, and the ``semmle.javascript.JsonParsers`` and
222+
``semmle.javascript.frameworks.XML`` modules for modeling JSON and XML parsers, respectively.
223+
224+
The ``semmle.javascript.Concepts`` module defines a small number of broad concepts such as system-command
225+
executions or file-system accesses, which are concretely instantiated in some of the existing
226+
framework libraries, but can of course be further extended to model additional frameworks.
227+
228+
Global data flow and taint tracking
229+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
230+
231+
Most security queries consist of:
232+
233+
- one QL file defining the query
234+
- one configuration module defining the taint-tracking configuration
235+
- one customization module defining sources, sinks, and sanitizers
236+
237+
For example, ``Security/CWE-078/CommandInjection.ql`` defines the command-injection query. It
238+
imports the module ``semmle.javascript.security.dataflow.CommandInjection``, which defines the
239+
configuration class ``CommandInjection::Configuration``. This module in turn imports
240+
``semmle.javascript.security.dataflow.CommandInjectionCustomizations``, which defines three abstract
241+
classes (``CommandInjection::Source``, ``CommandInjection::Sink``, and
242+
``CommandInjection::Sanitizer``) that model sources, sinks, and sanitizers, respectively.
243+
244+
To define additional sources, sinks or sanitizers for this or any other security query, import the
245+
customization module and extend these abstract classes with additional subclasses.
246+
247+
Note that for performance reasons you should normally only import the configuration module from a QL
248+
file. Importing it into the standard library (for example by importing it in ``Customizations.qll``)
249+
will slow down all the other security queries, since the configuration class will now be always in
250+
scope and flow from its sources to sinks will be tracked in addition to all the other configuration
251+
classes.
252+
253+
Another useful extension point is the class ``RemoteFlowSource``, which is used as a source by most
254+
queries looking for injection vulnerabilities (such as SQL injection or cross-site scripting). By
255+
extending it with new subclasses modelling other sources of user-controlled input you can
256+
simultaneously improve all of these queries.

0 commit comments

Comments
 (0)