Skip to content

Commit 05fb849

Browse files
authored
Update the protocol inference test infra with Mongo changes (#1758)
Summary: Previously, the TShark command in the `dataset_generation` script was not able to decode Mongo pcap files and insert them to the dataset for evaluation. This PR adds a flag to the TShark command to decode traffic running through port 27017 as Mongo. The readme is also updated to provide information about the bidirectional connection level dataset. **Updates to the confusion matrix** In the previous image, the connections per protocol in the dataset seem to have been duplicated leading to a large number of connections per protocol. This may have been due to the `dataset_generation` script appending data to the `.tsv` files each time it was ran even though the underlying pcap file content/counts not being altered. Running the `dataset_generation` script with empty `.tsv` files with the same pcap files followed by the `eval` script resulted in a matrix showing much fewer number of connections per protocol, suggesting that there may have been duplication in the dataset previously. The connection counts for each protocol in the older dataset seem to have increased by a factor of 4x or 8x the count as the new dataset and makes sense as to why the inference accuracy remained constant between the old/new matrix. The TLS connection count had dropped in the new matrix by the previous number of Mongo connections (432) due to the new TShark command decoding mongo connections. The Mongo captures may have been previously captured in one of the early iterations of running the `dataset_generation` script and not updated since in the old dataset. **New mongo additions** In the old dataset, the Mongo pcap files were mainly of type `OP_QUERY` which is an opcode that Stirling does not currently process. More mongo pcap files of type `OP_MSG` were added to test the existing inference rule and this resulted in 0.9% being mislabeled as `unknown` due to request side data missing from the connection and the existing rule not supporting response side inference for `OP_MSG` packets. 0.7% was mislabeled as `pgsql` due to request side data also missing from the connection and the opcode of the packet being one which is not is not recognizable by Stirling. Related issues: #640 Type of change: /kind test-infra Test Plan: Ran the dataset generation and evaluation scripts with the new TShark flag and verified the `.tsv` files were created appropriately and the confusion matrix was as expected. Signed-off-by: Kartik Pattaswamy <kpattaswamy@pixielabs.ai>
1 parent b868d8c commit 05fb849

File tree

3 files changed

+10
-2
lines changed

3 files changed

+10
-2
lines changed

src/stirling/protocol_inference/README.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,9 +62,16 @@ which is defined uniquely by `src_addr`, `dst_addr`, `src_port`, and `dst_port`.
6262
a series of packets in a connection. The goal is to evaluate if a connection is eventually correctly
6363
classified over a period over time.
6464

65+
#### bidirectional-connection-level dataset
66+
67+
One row in the bidirectional-connection-level dataset contains a series of packets over time in a bidrectional connection.
68+
Packets on both directions of a connection are merged by their `src_addr`, `dst_addr`, `src_port`, and `dst_port` and grouped to
69+
make the direction agnostic. This enables protocol inference on a series of packets in a bidirectional connection. The goal is
70+
to evaluate if at least one side of a connection can be classified to infer the protocol of the entire bidirectional connection.
71+
6572
## Protocol Inference Eval
6673

67-
There should be two tsv files `packet_dataset.tsv` and `conn_dataset.tsv` in the dataset folder.
74+
There should be three tsv files `packet_dataset.tsv`, `conn_dataset.tsv` and `bi_dir_conn_dataset.tsv` in the dataset folder.
6875
Right now, available models are {ruleset_basic, ruleset_basic_conn}.
6976
```shell script
7077
bazel run src/stirling/protocol_inference:eval -- --dataset <packet_dataset.tsv> --num_workers 8
-66.3 KB
Loading

src/stirling/protocol_inference/dataset_generation.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,8 @@ def gen_tshark_cmd():
8686
-e tcp.srcport \
8787
-e udp.srcport \
8888
-e tcp.dstport \
89-
-e udp.dstport"
89+
-e udp.dstport \
90+
-d tcp.port==27017,mongo"
9091

9192
for protocol, spec in ProtocolParsingSpecs.items():
9293
field_name = spec["length_field"]

0 commit comments

Comments
 (0)