|
| 1 | +Customizing the JavaScript analysis |
| 2 | +=================================== |
| 3 | + |
| 4 | +This document describes the main extension points offered by the JavaScript analysis for customizing |
| 5 | +analysis behavior without editing the queries or libraries themselves. |
| 6 | + |
| 7 | +Customization mechanisms |
| 8 | +------------------------ |
| 9 | + |
| 10 | +The two mechanisms used for customization are subclassing and overriding. |
| 11 | + |
| 12 | +We can teach the JavaScript analysis to handle further instances of abstract concepts it already |
| 13 | +understands by subclassing abstract classes and implementing their member predicates. For example, |
| 14 | +the standard library defines an abstract class ``SystemCommandExecution`` that covers various APIs |
| 15 | +for executing operating-system commands. This class is used by the command-injection analysis to |
| 16 | +identify problematic flows where input from a potentially malicious user is interpreted as the name |
| 17 | +of a system command to execute. By defining additional subclasses of ``SystemCommandExecution``, we |
| 18 | +can make this analysis more powerful without touching its implementation. |
| 19 | + |
| 20 | +By overriding a member predicate defined in the library, we can change its behavior either for all |
| 21 | +its receivers or only a subset. For example, the standard library predicate |
| 22 | +``ControlFlowNode::getASuccessor`` implements the basic control-flow graph on which many further |
| 23 | +analyses are based. By overriding it, we can add, suppress, or modify control-flow graph edges. |
| 24 | + |
| 25 | +Once a customization has been defined, it needs to be brought into scope so that the default |
| 26 | +analysis queries pick it up. This can be done by adding the customizing definitions to |
| 27 | +``Customizations.qll``, an initially empty library file that is imported by the default library |
| 28 | +``javascript.qll``. |
| 29 | + |
| 30 | +Sometimes you may want to perform both kinds of customizations at the same time. That is, subclass a base |
| 31 | +class to provide new implementations of an API, and override some member predicates of the same base |
| 32 | +class to selectively change the implementation of the API. This is not always easy to do, since the |
| 33 | +former requires the base class to be abstract, while the latter requires it to be concrete. |
| 34 | + |
| 35 | +To work around this, the JavaScript library uses the so-called *range pattern*. In this pattern, the base class |
| 36 | +``Base`` itself is concrete, but it has an abstract companion class called ``Base::Range`` covering |
| 37 | +the same set of values. To change the implementation of the API, subclass ``Base`` and override its |
| 38 | +member predicates. To provide new implementations of the API, subclass ``Base::Range`` and implement |
| 39 | +its abstract member predicates. |
| 40 | + |
| 41 | +For example, the class ``Base64::Encode`` in the standard library models base64-encoding libraries |
| 42 | +using the range pattern. It comes with subclasses corresponding to many popular base64 encoders. To |
| 43 | +add support for a new library, subclass ``Base64::Encode::Range`` and implement the member |
| 44 | +predicates ``getInput`` and ``getOutput``. To customize the definition of ``getInput`` or |
| 45 | +``getOutput`` for a library that is already supported, extend ``Base64::Encode`` itself and override |
| 46 | +the predicate you want to customize. |
| 47 | + |
| 48 | +Note that currently the range pattern is not used everywhere yet, so you will find some abstract |
| 49 | +classes without a concrete companion. We are planning on eventually migrating most abstract classes |
| 50 | +to use the range pattern. |
| 51 | + |
| 52 | +Analysis layers |
| 53 | +--------------- |
| 54 | + |
| 55 | +The JavaScript analysis libraries have a layered structure with higher-level analyses based on |
| 56 | +lower-level ones. Usually, classes and predicates in a lower layer should not depend on a higher |
| 57 | +layer to avoid performance problems and non-monotonic recursion. |
| 58 | + |
| 59 | +In this section, we briefly introduce the most important analysis layers, starting from the lowest |
| 60 | +layer. Below, we discuss the extension points offered by the individual layers. |
| 61 | + |
| 62 | +Abstract syntax tree |
| 63 | +~~~~~~~~~~~~~~~~~~~~ |
| 64 | + |
| 65 | +The abstract syntax tree (AST), implemented by class ``ASTNode`` and its subclasses, is the lowest layer |
| 66 | +and is a good representation of the information stored in the snapshot database. It |
| 67 | +corresponds closely to the syntactic structure of the program, only abstracting away from |
| 68 | +typographical details such as whitespace and indentation. |
| 69 | + |
| 70 | +Control-flow graph |
| 71 | +~~~~~~~~~~~~~~~~~~ |
| 72 | + |
| 73 | +The (intra-procedural) control-flow graph (CFG), implemented by class ``ControlFlowNode`` and its |
| 74 | +subclasses, is the next layer. It models flow of control inside functions and top-level scripts. The |
| 75 | +CFG is overlaid on top of the AST, meaning that each AST node has a corresponding CFG node. There |
| 76 | +are also synthetic CFG nodes that do not correspond to an AST node. For example, entry and exit |
| 77 | +nodes (``ControlFlowEntryNode`` and ``ControlFlowExitNode``) mark the beginning and end, |
| 78 | +respectively, of the execution of a function or top-level script, while guard nodes |
| 79 | +(``GuardControlFlowNode``) record that some condition is known to hold at some point in the program. |
| 80 | + |
| 81 | +Basic blocks (class ``BasicBlock``) organize control-flow nodes into maximal sequences of |
| 82 | +straight-line code, which is vital for efficiently reasoning about control flow. |
| 83 | + |
| 84 | +Static single-assignment form |
| 85 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 86 | + |
| 87 | +The static single-assignment (SSA) representation (class ``SsaVariable`` and ``SsaDefinition``) uses |
| 88 | +control-flow information to split up local variables into SSA variables that each only have a single |
| 89 | +definition. |
| 90 | + |
| 91 | +In addition to regular definitions corresponding to assignments and increment/decrement expressions, |
| 92 | +the SSA form also introduces pseudo-definitions such as |
| 93 | + |
| 94 | + - *phi nodes*, where multiple possible values for a variable are merged |
| 95 | + - *refinement nodes* (also known as *pi nodes*) marking program points where additional information about a variable becomes available that may restrict its possible set of values. |
| 96 | + |
| 97 | +Local data flow |
| 98 | +~~~~~~~~~~~~~~~ |
| 99 | + |
| 100 | +The (intra-procedural) data-flow graph, implemented by class ``DataFlow::Node`` and its subclasses, |
| 101 | +represents the flow of data within a function or top-level scripts. Each expression has a |
| 102 | +corresponding data-flow node. Additionally, there are data-flow nodes that do not correspond to |
| 103 | +syntactic elements. For example, each SSA variable has a corresponding data-flow node. Note that |
| 104 | +flow between functions (through arguments and return values) is not modeled in this layer, except |
| 105 | +for the special case of immediately-invoked function expressions. Flow through object properties is |
| 106 | +also not modeled. |
| 107 | + |
| 108 | +This layer also implements the widely-used source-node API. The class ``DataFlow::SourceNode`` and its |
| 109 | +subclasses represent data-flow nodes where new objects are created (such as object expressions), or |
| 110 | +where non-local data flow enters the intra-procedural data-flow graph (such as function parameters |
| 111 | +or property reads). The source-node API provides convenient predicates for reasoning about these |
| 112 | +nodes without having to explicitly encode data-flow graph traversal. |
| 113 | + |
| 114 | +Type inference |
| 115 | +~~~~~~~~~~~~~~ |
| 116 | + |
| 117 | +Class ``AnalyzedNode`` and its subclasses implement (intra-procedural) type inference on top of the |
| 118 | +local data-flow graph. Some reasoning about properties is implemented as well, but more advanced |
| 119 | +features such as the prototype chain are not considered. |
| 120 | + |
| 121 | +Call graph |
| 122 | +~~~~~~~~~~ |
| 123 | + |
| 124 | +The call graph is implemented as a predicate ``getACallee`` on ``DataFlow::InvokeNode``, the class |
| 125 | +of data-flow nodes representing function calls (with or without ``new``). It uses local data flow and |
| 126 | +type information, as well as type annotations where available. |
| 127 | + |
| 128 | +Type tracking |
| 129 | +~~~~~~~~~~~~~ |
| 130 | + |
| 131 | +The type-tracking framework (classes ``DataFlow::TypeTracker`` and ``DataFlow::TypeBackTracker``) is |
| 132 | +a library for implementing custom type inference systems that track values inter-procedurally, |
| 133 | +including tracking through one level of object properties. |
| 134 | + |
| 135 | +Framework models |
| 136 | +~~~~~~~~~~~~~~~~ |
| 137 | + |
| 138 | +The libraries under ``semmle/javascript/frameworks`` model a broad range of popular JavaScript |
| 139 | +libraries and frameworks, such as Express and Vue.js. Some framework modeling libraries are located |
| 140 | +under ``semmle/javascript`` directly, for instance ``Base64``, ``EmailClients``, and ``JsonParsers``. |
| 141 | + |
| 142 | +Global data flow and taint tracking |
| 143 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 144 | + |
| 145 | +The inter-procedural data flow and taint-tracking libraries can be used to implement static |
| 146 | +information-flow analyses. Most of our security queries are based on this approach. |
| 147 | + |
| 148 | +Extension points |
| 149 | +---------------- |
| 150 | + |
| 151 | +In this section, we discuss the most important extension points for the individual analysis layers introduced |
| 152 | +above. |
| 153 | + |
| 154 | +AST |
| 155 | +~~~ |
| 156 | + |
| 157 | +This layer should not normally be customized. It is technically possible to override, for instance, |
| 158 | +``ASTNode.getChild`` to change the way the AST structure is represented, but this should normally be |
| 159 | +avoided in the interest of keeping a close correspondence between AST and concrete syntax. |
| 160 | + |
| 161 | +CFG |
| 162 | +~~~ |
| 163 | + |
| 164 | +You can override ``ControlFlowNode.getASuccessor`` to customize the control-flow graph. Note that |
| 165 | +overriding ``ControlFlowNode.getAPredecessor`` is not normally useful, since it is rarely used in |
| 166 | +higher-level libraries. |
| 167 | + |
| 168 | +SSA |
| 169 | +~~~ |
| 170 | + |
| 171 | +It is not normally necessary to customize this layer. |
| 172 | + |
| 173 | +Local data flow |
| 174 | +~~~~~~~~~~~~~~~ |
| 175 | + |
| 176 | +The ``DataFlow::SourceNode`` class uses the range pattern, so new kinds of source nodes can be |
| 177 | +added by extending ``Dataflow::SourceNode::Range``. Some of its subclasses can similarly be |
| 178 | +extended. For example, ``DataFlow::ModuleImportNode`` models module imports, and ``DataFlow::ClassNode`` models |
| 179 | +class definitions. The former provides default implementations covering CommonJS, AMD, and ECMAScript |
| 180 | +2015 modules, while the latter handles ECMAScript 2015 classes, as well as traditional function-based |
| 181 | +classes. You can extend their corresponding ``::Range`` classes to add support for other module or |
| 182 | +class systems. |
| 183 | + |
| 184 | +Type inference |
| 185 | +~~~~~~~~~~~~~~ |
| 186 | + |
| 187 | +You can override ``AnalyzedNode::getAValue`` to customize the type inference. Note that the type |
| 188 | +inference is expected to be sound, that is (as far as practical), the abstract values inferred for a |
| 189 | +data-flow node should cover all possible concrete values this node may take on at runtime. |
| 190 | + |
| 191 | +You can also extend the set of abstract values. To add individual abstract values that are |
| 192 | +independent of the program being analyzed, define a subclass of ``CustomAbstractValueTag`` |
| 193 | +describing the new abstract value. There will then be a corresponding value of class |
| 194 | +``CustomAbstractValue`` that you can use in overriding definitions of the ``getAValue`` predicate. |
| 195 | + |
| 196 | +Call graph |
| 197 | +~~~~~~~~~~ |
| 198 | + |
| 199 | +You can override ``DataFlow::InvokeNode::getACallee(int)`` to customize the call graph. Note that |
| 200 | +overriding the zero-argument version ``getACallee()`` is not enough, since higher layers use the |
| 201 | +one-argument version. |
| 202 | + |
| 203 | +Type tracking |
| 204 | +~~~~~~~~~~~~~ |
| 205 | + |
| 206 | +It is not normally necessary to customize this layer. |
| 207 | + |
| 208 | +Framework models |
| 209 | +~~~~~~~~~~~~~~~~ |
| 210 | + |
| 211 | +The ``semmle.javascript.frameworks.HTTP`` module defines many abstract classes that can be extended |
| 212 | +to implement support for new web server frameworks. These classes, in turn, are used by some of the |
| 213 | +security queries (such as the reflected cross-site scripting query) to define sources and sinks, so |
| 214 | +these queries will automatically benefit from the additional modeling. |
| 215 | + |
| 216 | +Similarly, the ``semmle.javascript.frameworks.ClientRequests`` module defines an abstract class for |
| 217 | +modeling client-side HTTP requests. It comes with built-in support for a number of popular |
| 218 | +frameworks, and you can add support for new frameworks by extending the abstract class. |
| 219 | + |
| 220 | +The ``semmle.javascript.frameworks.SQL`` module defines abstract classes for modeling SQL |
| 221 | +connector libraries, and the ``semmle.javascript.JsonParsers`` and |
| 222 | +``semmle.javascript.frameworks.XML`` modules for modeling JSON and XML parsers, respectively. |
| 223 | + |
| 224 | +The ``semmle.javascript.Concepts`` module defines a small number of broad concepts such as system-command |
| 225 | +executions or file-system accesses, which are concretely instantiated in some of the existing |
| 226 | +framework libraries, but can of course be further extended to model additional frameworks. |
| 227 | + |
| 228 | +Global data flow and taint tracking |
| 229 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 230 | + |
| 231 | +Most security queries consist of: |
| 232 | + |
| 233 | + - one QL file defining the query |
| 234 | + - one configuration module defining the taint-tracking configuration |
| 235 | + - one customization module defining sources, sinks, and sanitizers |
| 236 | + |
| 237 | +For example, ``Security/CWE-078/CommandInjection.ql`` defines the command-injection query. It |
| 238 | +imports the module ``semmle.javascript.security.dataflow.CommandInjection``, which defines the |
| 239 | +configuration class ``CommandInjection::Configuration``. This module in turn imports |
| 240 | +``semmle.javascript.security.dataflow.CommandInjectionCustomizations``, which defines three abstract |
| 241 | +classes (``CommandInjection::Source``, ``CommandInjection::Sink``, and |
| 242 | +``CommandInjection::Sanitizer``) that model sources, sinks, and sanitizers, respectively. |
| 243 | + |
| 244 | +To define additional sources, sinks or sanitizers for this or any other security query, import the |
| 245 | +customization module and extend these abstract classes with additional subclasses. |
| 246 | + |
| 247 | +Note that for performance reasons you should normally only import the configuration module from a QL |
| 248 | +file. Importing it into the standard library (for example by importing it in ``Customizations.qll``) |
| 249 | +will slow down all the other security queries, since the configuration class will now be always in |
| 250 | +scope and flow from its sources to sinks will be tracked in addition to all the other configuration |
| 251 | +classes. |
| 252 | + |
| 253 | +Another useful extension point is the class ``RemoteFlowSource``, which is used as a source by most |
| 254 | +queries looking for injection vulnerabilities (such as SQL injection or cross-site scripting). By |
| 255 | +extending it with new subclasses modelling other sources of user-controlled input you can |
| 256 | +simultaneously improve all of these queries. |
0 commit comments