Commit 05fb849
authored
Update the protocol inference test infra with Mongo changes (#1758)
Summary: Previously, the TShark command in the `dataset_generation`
script was not able to decode Mongo pcap files and insert them to the
dataset for evaluation. This PR adds a flag to the TShark command to
decode traffic running through port 27017 as Mongo. The readme is also
updated to provide information about the bidirectional connection level
dataset.
**Updates to the confusion matrix**
In the previous image, the connections per protocol in the dataset seem
to have been duplicated leading to a large number of connections per
protocol. This may have been due to the `dataset_generation` script
appending data to the `.tsv` files each time it was ran even though the
underlying pcap file content/counts not being altered.
Running the `dataset_generation` script with empty `.tsv` files with the
same pcap files followed by the `eval` script resulted in a matrix
showing much fewer number of connections per protocol, suggesting that
there may have been duplication in the dataset previously.
The connection counts for each protocol in the older dataset seem to
have increased by a factor of 4x or 8x the count as the new dataset and
makes sense as to why the inference accuracy remained constant between
the old/new matrix.
The TLS connection count had dropped in the new matrix by the previous
number of Mongo connections (432) due to the new TShark command decoding
mongo connections. The Mongo captures may have been previously captured
in one of the early iterations of running the `dataset_generation`
script and not updated since in the old dataset.
**New mongo additions**
In the old dataset, the Mongo pcap files were mainly of type `OP_QUERY`
which is an opcode that Stirling does not currently process. More mongo
pcap files of type `OP_MSG` were added to test the existing inference
rule and this resulted in 0.9% being mislabeled as `unknown` due to
request side data missing from the connection and the existing rule not
supporting response side inference for `OP_MSG` packets. 0.7% was
mislabeled as `pgsql` due to request side data also missing from the
connection and the opcode of the packet being one which is not is not
recognizable by Stirling.
Related issues: #640
Type of change: /kind test-infra
Test Plan: Ran the dataset generation and evaluation scripts with the
new TShark flag and verified the `.tsv` files were created appropriately
and the confusion matrix was as expected.
Signed-off-by: Kartik Pattaswamy <kpattaswamy@pixielabs.ai>1 parent b868d8c commit 05fb849
File tree
3 files changed
+10
-2
lines changed- src/stirling/protocol_inference
3 files changed
+10
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
65 | 72 | | |
66 | 73 | | |
67 | | - | |
| 74 | + | |
68 | 75 | | |
69 | 76 | | |
70 | 77 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
86 | 86 | | |
87 | 87 | | |
88 | 88 | | |
89 | | - | |
| 89 | + | |
| 90 | + | |
90 | 91 | | |
91 | 92 | | |
92 | 93 | | |
| |||
0 commit comments