Usage

After the ml-sast prototype has been installed, following the steps outlined in the Installation guide, this guide provides the necessary steps to analyze your own source code for potential security vulnerabilities.

Project directory

The project directory needs to contain the following four items:

A directory code, which contains the source code that should be analyzed
A build script build.sh - that defines how the source code is compiled
A main function, that calls all the exposed APIs of the source code
The file config.yaml, that define the analysis steps and analysis parameters

An example of this setup is given by the example project to provide you a starting point to define and configure the analysis steps to your needs.

Code - Directory

This directory contains the source code that should be analyzed.

build.sh

This scripts defines how the source code is compiled. As the build system is based on Ubuntu 20.04, needed dependencies for the compilation of the source code can be installed with the help of apt or apt-get.

config.yaml

This is the main configuration file for the tool that must be provided with every software project to be analyzed. It configures the individual analysis phases and the order in which the tool is supposed to execute them. At the top level, the config.yaml is a dictionary, where key corresponds to a phase, with the exception of the first key: general. All other phases use their exact class names that are defined by the python modules found in mlsast/logic/steps. Internally the tool maps these names onto the corresponding classes and executes them in the order (top to bottom) that the config.yaml-file dictates. Listed below are all phases that the tool is shipped with and their configuration parameters.

general

key	type	usage
project	string	The name of the project to analyze (merely for documentation purposes).
source	string	File path to the source code to be analyzed.
loglevel	string	The log level (DEBUG or INFO or WARNING or ERROR).

Example

general:
  project: "example"  
  source: ./code  
  loglevel: DEBUG

svf

key	type	usage
binray	string	The binary that results from the compilation defined in the build.sh script.
delete	bool	If true, the docker container of this phase will be removed after execution.
graphs	list	The internal graph representations that SVF must build for the analysis:
↳ tldg	string	The top level dependency graph (LLVM-IR SSA w/o points-to information).
↳ svfg	string	The sparse value flow graph (data dependencies considering points-to info).
↳ icfg	string	The interprocedural control flow graph (required).
↳ ptacg	string	The call graph (includes points-to information, required for icfg).
options	dict	Additonal options for SVF¹.

¹) The number of threads for the parallel execution of the points to analysis may be provided here, using the versioning_threads key and supplying the number of threads an integer value. Moreover, the preciseness of the points-to analysis may be set here as well, through the key: precise_analysis. The value should be set to precise (recommended). Omitting this key altogether will result in the imprecise Andersen's points-to analysis to be used.

Example

svf:
  graphs:
    - tldg
    - icfg
    - svfg
    - ptacg
  options:
    versioning_threads: 7
    precise_analysis: true
  binary: entrypoint
  delete: true

neo4j

key	type	usage
delete	bool	Removes Neo4j docker container after execution.
memory	dict	Memory settings for the Neo4j database.
↳ auto	dict	Neo4j will automatically determine the best configuration.
↳ heap.initial_size²	dict	Initial heap size in MB (m) or GB (g).
↳ heap.max_size²	dict	Maximum heap space in MB (m) or GB (g).
↳ pagecache.size²	dict	Available page cache size in MB (m) or GB (g).
↳ off_heap.max_size²	dict	Maximum off heap size in MB (m) or GB (g).
prepare	string	Preparation query to be executed before the actual query
query	string	Query used to extract proogram data from the neo4j DB.

²) Must be prefixed with dbms.memory., please also refer to the official Neo4j documentation for more information: Link.

Example

neo4j:
  delete: true
  memory:
    dbms.memory.heap.initial_size: 5100m
    dbms.memory.heap.max_size: 5100m
    dbms.memory.pagecache.size: 2900m
    dbms.memory.off_heap.max_size: 4000m
prepare: |
         // INIT DATABASE
         CALL mlsast.prepare.generateLabels();
         CALL mlsast.prepare.generateIndeces();
         CALL mlsast.prepare.connectGraphs();
         CALL mlsast.prepare.makeRelationships();
         CALL mlsast.prepare.setIcfgProps();
         CALL mlsast.prepare.buildIcfgPhis();
query: |
       // Get list of procedures in program.
       CALL mlsast.util.listProcedures()
       YIELD str AS f

       // Find call sites for procedures.
       WITH f
       MATCH (n:FunEntryICFGNode {func_name: f})

       // Find exit nodes for same procedures.
       WITH f, n
       MATCH (m:FunExitICFGNode {func_name: f})

       // Find unique paths between call sites and exit points.
       WITH f, n, collect(m) AS exits
         CALL apoc.path.expandConfig(n, {
             relationshipFilter: "icfg>",
             terminatorNodes: exits,
             uniqueness: "NODE_LEVEL",
             maxLevel: 75
         })

         YIELD path

       // Return name of the procedure (f) and the associated paths (path).
       RETURN f, path;

distance

key	type	usage
model_path	string	The model to be used for the distance analysis.
node_property	string	Data used for the embedding (node_type or ir_opcode).
threshold_factor	int	Factor by which to alter the detection threshold.
centroids	string	Paths on which the models should be initialized (good or bad).
clustering_type	string	Clustering algorithm used to generate the centroids (kmeans).
expected_clusters	int	Expected clusters for each defect class (recommended: 8).
threshold_scaling	int	Factor to scale the detection threshold.
include_models³ ⁴	list	Defect classes that should be included in the analysis.
exclude_models⁴	list	Defect classes by CWE that should be excluded in the analysis.
source_code_delta	int	Reported lines before and after defective code lines.

³) Set to all if all models should be included.

⁴) For the supplied model that was generated on the basis of the Juliet test suite, this would be the corresponding CWE-Class, e.g., CWE-690

Example

distance:
  model_path: models/juliet.zip
  node_property: node_type
  threshold_factor: 1
  centroids: bad  
  clustering_type: kmeans
  expected_clusters: 8
  threshold_scaling: 1
  include_models: all
    - CWE-122
    - CWE-690
  exclude_models:
    - CWE-606
  source_code_delta: 4

main-function

To buid the interprocedural control flow graph (ICFG) of a program, the tool requires a single entrypoint into the program. Normally, the main procedure of any program serves as such entrypoint. However, in some cases there is no main procedure available, e.g., due to the software being a library. In this case, it is necessary to manually devise a main procedure that calls every other procedure that should be respected when generating the ICFG. It is not necessary that the resulting program is executable as long as it is compilable. This means especially that bogus parameters may be passed to the library procedure calls, as long as they respect the procedures signature (e.g., null pointers instead of valid pointers).

Run ml-sast prototype

python3 -m mlsast -p example/

Running the ml-sast prototype for the first time, takes some extra time, as some docker images need to be downloaded and others need to be build.

Visualize results

To visualize the results of the ml-sast prototype a small frontend has been implemented. This can be started by issuing the following command:

cd frontend
python3 -m frontend --report ../example/report.json --models ../models/juliet.zip

Optional options:

The frontend can also be started with two optional options, to filter the detected paths. With the first option ("-x") the reports can be filtered to match a defined regular expressions. With the help of the second option ("-b") the results for the specified models are excluded. Using these two options would result in the following command:

python3 -m frontend --report ../example/report.json --models ../models/juliet.zip -x "^CWE\d{2,3}_.*_\d+" -b CWE-426

Starting the frontend with these options, only loads paths in which the function name of the first node matches the specified regex. The second option removes results concerning the specified model.

Oracle Tests

As described in the study, the prototype was evaluated on real-world software defects. The software defects were extracted by Fan et al. and published in the study. The commit-hashes that fix these software defects are included with the prototype along with several scripts to download the software projects in the version prior to these fixes. These scripts are located in the directory 'evaluation'.

To download the software defects run the following commands:

cd evaluation
bash build.sh && bash build_miniupnp_special_cases.sh && bash build_openjpeg_special_cases.sh

After downloading the software projects in the various versions you can evaluate the prototype on each defect by running the command:

cd evaluation
bash run_prototype_on_oracle_tests.sh

When the prototype has analyzed each version of the different software projects, you can filter the reports using the python script 'evaluate-oracle-tests.py':

cd evaluation
python3 evaluate-oracle-tests.py

After running this script only reports that match the defects in filename and line numbers are kept and stored in each directory under the name 'filtered_report.json'. These filtered reports can then be visualized with the frontend, which comes with the prototype, and manually checked if the prototype correctly detected the defects. To do this change into the directory frontend and issue for example the following command:

cd frontend
python3 -m frontend --report ../evaluation/results/lib/miniupnp/79cca974a4c2ab1199786732a67ff6d898051b78/1.0/filtered_report.json --models ../models/juliet.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage.md

Usage.md

Usage

Project directory

Code - Directory

build.sh

config.yaml

general

Example

svf

Example

neo4j

Example

distance

Example

main-function

Run ml-sast prototype

Visualize results

Optional options:

Oracle Tests

Files

Usage.md

Latest commit

History

Usage.md

File metadata and controls

Usage

Project directory

Code - Directory

build.sh

config.yaml

general

Example

svf

Example

neo4j

Example

distance

Example

main-function

Run ml-sast prototype

Visualize results

Optional options:

Oracle Tests