Anomaly Detection in Log Data: A Comparative Study - Reproducing the results
Reproduced the results of this paper.
Introduction
Following the examination and interpretation of the given publication, I collected the results, mesurements and the main conclusions of the paper, so I can set up test cases with the goal of reproducing the listed goals. I encountered several techniques and definitions I wasn’t familiar with, so before conducting the the measurements, I looked into them.
Goals
- Benchmark popular models on the unified pipeline, on different datasets, with shuffling and 0.5 training ratio. Compare the median F1 score of the test splits, and the IQR score with the result in the paper.
- Reproduce differences in model performance, based on different preprocessing strategies, like parsing the HDFS dataset, where template explosion can happen because log lines with dynamic number of variables will create different templates, even if the semantics are the same behind the log lines. This has been remedied in the LogHub version of the HDFS dataset which was also used in the paper. BGL dataset logs can also be aggregated and grouped by components, or by time or line limits or both. Each strategies have their pros and cons.
- Shuffled cross validation is overly optimistic. By shuffling, future logs will be mixed with past logs, therefore the model might learn patterns from ”future” data, that might not be available in a real-world scenario. The paper introduces the importance of sequential cross validation, which preserves the order of sequences, and uses earlier data for training, and future data for testing. Reproduce the differences between shuffled and sequential cross validation metrics.
Benchmarking popular models
One of the product of the paper, was a unified pipeline for anomaly detection in logs, which was made with modularity in mind, so it’s easier to evaluate and benchmark models, parsers, represantations on different datasets. This pipeline enables to benchmark popular models in a fair comparision, by fixing sampling strategy, identical sequence definitions and hyperparameter tuning. Also the fact the the authors published the whole framework on Github, made it straightforward for me to reproduce it. For example python orchestrator.py HDFSLogHub SemPCA --shuffle --train ratio 0.1 command with framework will download, parse, aggregate the templates, encode features in the HDFSLogHub dataset and train and evaluate the SemPCA model with shuffling the sequences and with 10% of the sequences. An example of the output for a benchmark:
| Off | Split | Prec | Rec | F1 |
|---|---|---|---|---|
| 0.0 | train | 0.9397 | 0.9736 | 0.9564 |
| 0.0 | val | 0.9472 | 0.9751 | 0.9610 |
| 0.0 | test | 0.9493 | 0.9717 | 0.9604 |
| 0.1 | train | 0.9359 | 0.7941 | 0.8592 |
| 0.1 | val | 0.9379 | 0.7973 | 0.8619 |
| 0.1 | test | 0.9376 | 0.7976 | 0.8620 |
| 0.2 | train | 0.9483 | 0.9678 | 0.9580 |
| 0.2 | val | 0.9527 | 0.9766 | 0.9645 |
| 0.2 | test | 0.9475 | 0.9725 | 0.9598 |
| 0.3 | train | 0.9527 | 0.9766 | 0.9645 |
| 0.3 | val | 0.9517 | 0.9731 | 0.9623 |
| 0.3 | test | 0.9471 | 0.9718 | 0.9593 |
| 0.4 | train | 0.9517 | 0.9731 | 0.9623 |
| 0.4 | val | 0.9402 | 0.9745 | 0.9570 |
| 0.4 | test | 0.9487 | 0.9721 | 0.9603 |
| 0.5 | train | 0.9375 | 0.9307 | 0.9341 |
| 0.5 | val | 0.9510 | 0.9204 | 0.9354 |
| 0.5 | test | 0.9461 | 0.9291 | 0.9376 |
| 0.6 | train | 0.9533 | 0.9689 | 0.9611 |
| 0.6 | val | 0.9461 | 0.9700 | 0.9579 |
| 0.6 | test | 0.9477 | 0.9731 | 0.9602 |
| 0.7 | train | 0.9461 | 0.9700 | 0.9579 |
| 0.7 | val | 0.9504 | 0.9709 | 0.9606 |
| 0.7 | test | 0.9481 | 0.9729 | 0.9603 |
| 0.8 | train | 0.9504 | 0.9709 | 0.9606 |
| 0.8 | val | 0.9513 | 0.9733 | 0.9622 |
| 0.8 | test | 0.9474 | 0.9725 | 0.9598 |
| 0.9 | train | 0.9488 | 0.7816 | 0.8571 |
| 0.9 | val | 0.9327 | 0.7731 | 0.8454 |
| 0.9 | test | 0.9448 | 0.7749 | 0.8514 |
This is in line with the method presented in the paper, where each fold is split into a continous train (10% of the sequences because the parameter above), validation (constant 10% in the pipeline) and test (remainder) splits. In each fold, this changes with an offset, like a sliding window. I made a small script to calculate the median F1 score and the IQR score based on the output metrics.
import pandas as pd
import sys
import io
def analyze_results(file_path):
try:
df = pd.read_csv(file_path)
except FileNotFoundError:
print(f"Error: File '{file_path}' not found.")
return
except Exception as e:
print(f"Error reading file: {e}")
return
if 'split' not in df.columns or 'f1' not in df.columns:
print("Error: The CSV must contain 'split' and 'f1' columns.")
return
test_results = df[df['split'] == 'test'].copy()
if test_results.empty:
print("Error: No rows with split='test' found.")
return
f1_scores = test_results['f1']
median_f1 = f1_scores.median()
q75 = f1_scores.quantile(0.75)
q25 = f1_scores.quantile(0.25)
iqr = q75 - q25
print(f"{median_f1:.4f} {iqr:.4f}")
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python median.py <path_to_csv_file>")
else:
analyze_results(sys.argv[1])
This script gives us the metrics which was presented in the paper, for a dataset and model combination.
For the general benchmarking of popular models, the paper used shuffled dataset, with 50% training ratio. I did not want to train the models on my very own personal computer, rather I utilized 2 unused Fujitsu servers, sitting in the rack of the Kollégiumi Számítástechnikai Kör (KSZK). Also, these servers had NVIDIA Quadro P620 GPU, one in each node, and I hoped that these might come handy in the training of neural network models. Sadly, based on my experience, somehow the framework does not utilize the GPUs, even in the presence of correct CUDA and NVIDIA drivers.
For this to work, I put together a small training script which merges the training and the exportation of the needed metrics.
#!/bin/bash
dataset=$1
model=$2
cd ~/LogADComp && source venv/bin/activate
python orchestrator.py "$dataset" "$model" --shuffle --train_ratio 0.5 > /dev/null 2>&1
metric_file="outputs/${dataset}Shuffled_0.5/${model}/metrics.csv"
if [ -f "$metric_file" ]; then
f1_score=$(python median.py "$metric_file")
else
f1_score="ERROR"
fi
echo "$dataset,$model,$f1_score"
Also had to modify the given orchestrator.py, so it saves the metrics in a CSV file.
As a last step to distribute the training of the models between my 2 nodes, I used the GNU parallel program.
parallel --colsep ' ' \
--sshloginfile nodes \
--jobs 2 \
"./LogADComp/train.sh {1} {2}" \
:::: args \
| tee master_result.csv
Parallel will use SSH to reach the given nodes, based on --sshloginfile parameter, run the given command, which is made from the command template given, and the file containing the arguments. On each node, 2 models will be trained in parallel, and the standard output of the results from each node is piped and aggregated into a file my machine.
The raw output metrics of my benchmarks put into a table similarly like in the paper:
| Method | HDFS | BGL | Thunderbird |
|---|---|---|---|
| NeuralLog | 0.9890 (0.0055) | 0.9800 (0.0165) | 0.9982 (0.0003) |
| LogRobust | 0.9959 (0.0021) | 0.9400 (0.0225) | 0.9994 (0.0000) |
| SVM | 0.9768 (0.0015) | 0.9167 (0.0027) | 0.9990 (0.0001) |
| LogBERT | 0.7333 (0.0131) | 0.7904 (0.0092) | 0.9682 (0.0006) |
| LogAnomaly | 0.9117 (0.0238) | 0.7739 (0.0160) | 0.9476 (0.0011) |
| DeepLog | 0.8900 (0.0854) | 0.7690 (0.0111) | 0.9466 (0.0007) |
| LogCluster | 0.9314 (0.0037) | 0.7614 (0.0027) | 0.4478 (0.1473) |
| SemPCA | 0.9443 (0.0152) | 0.4485 (0.0060) | 0.4845 (0.0015) |
| PCA | 0.8104 (0.0252) | 0.4396 (0.0034) | 0.3791 (0.0022) |
Median F1 Score (IQR) for Each Anomaly-Detection Method Under 10-Fold Shuffled Cross-Validation on HDFS, BGL, and Thunderbird.
The result: after rounding the values to 3 decimals, I get exactly the same mesurements as the paper. In overall, the same conclusions can be drawn, supervised models did the best. On the HDFS dataset, LogRobust was the most perfomant, on BGL it was NeuralLog and on Thunderbird it was LogRobust and SVM tie.
Differences in model performance, based on different preprocessing strategies
HDFS
As I understood, the original HDFS contained logs, which had dynamic
number of variables and Drain, upon parsing the logs, created different
templates for each. For example the log Deleting block blk1 blk2 would
create the template Deleting block * * and the
Deleting block blk1 blk2 blk3 log would create Deleting block * * *,
therefore a different template, erasing the semantics that it's about
the same operation.
The original HDFS dataset, which URL was hardcoded in the provided framework, unfortunately is no longer online. However the website containing the dataset was archived, and published by the Wayback machine, so I managed to get a hold of it. This did not proved to be enough, because for some reason, I could not figure out, Drain could not parse the dataset and extract any templates.
Nonetheless I still ran the benchmarking on the HDFSFixed and the HDFSLogHub datasets, in which this template explosion doesn't happen because of preprocessing and filtering, but the importance of this cannot be seen, because for that I would have needed to run the models on the original, non-filtered dataset.
The results:
| Method | HDFS Fixed | HDFS LogHub |
|---|---|---|
| NeuralLog | 0.9712 (0.0144) | 0.9890 (0.0055) |
| LogRobust | 0.9913 (0.0035) | 0.9959 (0.0021) |
| SVM | 0.9859 (0.0050) | 0.9768 (0.0015) |
| LogBERT | 0.7089 (0.0297) | 0.7333 (0.0131) |
| LogAnomaly | 0.8920 (0.0284) | 0.9117 (0.0238) |
| DeepLog | 0.8284 (0.0348) | 0.8900 (0.0854) |
| LogCluster | 0.9374 (0.0053) | 0.9314 (0.0037) |
| SemPCA | 0.9372 (0.0025) | 0.9443 (0.0152) |
| PCA | 0.7239 (0.0248) | 0.8104 (0.0252) |
Median F1 Score (IQR) for Each Anomaly-Detection Method Under 10-Fold Shuffled Cross-Validation on HDFS with Various Preprocessing Options.
BGL grouping strategies
Here I show my results of benchmarking models on different grouping strategies on the BGL dataset.
| Method | Component & 120l | 120l & 60s | 40l & 60s |
|---|---|---|---|
| NeuralLog | 0.9859 (0.0029) | 0.8630 (0.0121) | 0.9800 (0.0165) |
| LogRobust | 0.9982 (0.0029) | 0.9301 (0.0171) | 0.9400 (0.0225) |
| SVM | 0.9996 (0.0003) | 0.9244 (0.0060) | 0.9167 (0.0027) |
| LogBERT | 0.9831 (0.0010) | 0.8341 (0.0101) | 0.7904 (0.0092) |
| LogAnomaly | 0.8270 (0.0239) | 0.7809 (0.0205) | 0.7739 (0.0160) |
| DeepLog | 0.8335 (0.0371) | 0.7292 (0.0468) | 0.7690 (0.0111) |
| LogCluster | 0.9459 (0.0025) | 0.7893 (0.0054) | 0.7614 (0.0027) |
| SemPCA | 0.6742 (0.0011) | 0.2937 (0.1415) | 0.4485 (0.0060) |
| PCA | 0.5990 (0.0037) | 0.4330 (0.0044) | 0.4396 (0.0034) |
Median F1 Score (IQR) for Each Method Under 10-Fold Shuffled Cross-Validation on BGL with Various Grouping Options.
Influence of Data-splitting Procedures
In this section I reproduce the differences between shuffled and sequential splitting with different training ratios.
| Method | 50% Ratio (Shuffled) | 50% Ratio (Sequential) | 1% Ratio (Shuffled) | 1% Ratio (Sequential) |
|---|---|---|---|---|
| NeuralLog | 0.9890 (0.0055) | 0.9173 (0.1119) | 0.0000 (0.0000) | 0.0000 (0.3803) |
| LogRobust | 0.9959 (0.0021) | 0.9541 (0.0612) | 0.9584 (0.0184) | 0.5168 (0.2955) |
| SVM | 0.9768 (0.0015) | 0.9315 (0.0709) | 0.9445 (0.0333) | 0.7876 (0.1723) |
| LogBERT | 0.7333 (0.0131) | 0.6100 (0.4167) | 0.6028 (0.0699) | 0.1351 (0.2413) |
| LogAnomaly | 0.9117 (0.0238) | 0.7062 (0.4752) | 0.9325 (0.0091) | 0.1903 (0.5681) |
| DeepLog | 0.8900 (0.0854) | 0.5773 (0.5575) | 0.7752 (0.0860) | 0.4206 (0.6893) |
| LogCluster | 0.9314 (0.0037) | 0.9019 (0.6017) | 0.9312 (0.0105) | 0.1414 (0.0565) |
| SemPCA | 0.9443 (0.0152) | 0.6930 (0.0899) | 0.9434 (0.0006) | 0.0994 (0.0326) |
| PCA | 0.8104 (0.0252) | 0.7953 (0.1801) | 0.7903 (0.0075) | 0.1473 (0.3558) |
Median F1 Score (IQR) for HDFS. (50% vs 1% Training Ratios, Shuffled vs Sequential Splits).
| Method | BGL 50% (Shuffled) | BGL 50% (Sequential) | Thunderbird 50% (Shuffled) | Thunderbird 50% (Sequential) |
|---|---|---|---|---|
| NeuralLog | 0.9800 (0.0165) | 0.4895 (0.6898) | 0.9982 (0.0003) | 0.9928 (0.0059) |
| LogRobust | 0.9400 (0.0225) | 0.1887 (0.3225) | 0.9994 (0.0000) | 0.9931 (0.0069) |
| SVM | 0.9167 (0.0027) | 0.5698 (0.1114) | 0.9990 (0.0001) | 0.9971 (0.0047) |
| LogBERT | 0.7904 (0.0092) | 0.2570 (0.3279) | 0.9682 (0.0006) | 0.8604 (0.1393) |
| LogAnomaly | 0.7739 (0.0160) | 0.3799 (0.2901) | 0.9476 (0.0011) | 0.7773 (0.1577) |
| DeepLog | 0.7690 (0.0111) | 0.3500 (0.3438) | 0.9466 (0.0007) | 0.8081 (0.1200) |
| LogCluster | 0.7614 (0.0027) | 0.4202 (0.3051) | 0.4478 (0.1473) | 0.4723 (0.1907) |
| SemPCA | 0.4485 (0.0060) | 0.1764 (0.0612) | 0.4845 (0.0015) | 0.2145 (0.2783) |
| PCA | 0.4396 (0.0034) | 0.2788 (0.1835) | 0.3791 (0.0022) | 0.5219 (0.2354) |
Median F1 Score (IQR) for BGL, Thunderbird. (50% Shuffled vs 50% Sequential Splits).
| Method | HDFS% (Shuf 10%) | HDFS% (Seq 10%) | BGL Shuffled (10%) | BGL Shuffled (1%) |
|---|---|---|---|---|
| NeuralLog | 0.9800 (0.0165) | 0.4895 (0.6898) | 0.9982 (0.0003) | 0.9928 (0.0059) |
| LogRobust | 0.9400 (0.0225) | 0.1887 (0.3225) | 0.9994 (0.0000) | 0.9931 (0.0069) |
| SVM | 0.9167 (0.0027) | 0.5698 (0.1114) | 0.9990 (0.0001) | 0.9971 (0.0047) |
| LogBERT | 0.7904 (0.0092) | 0.2570 (0.3279) | 0.9682 (0.0006) | 0.8604 (0.1393) |
| LogAnomaly | 0.7739 (0.0160) | 0.3799 (0.2901) | 0.9476 (0.0011) | 0.7773 (0.1577) |
| DeepLog | 0.7690 (0.0111) | 0.3500 (0.3438) | 0.9466 (0.0007) | 0.8081 (0.1200) |
| LogCluster | 0.7614 (0.0027) | 0.4202 (0.3051) | 0.4478 (0.1473) | 0.4723 (0.1907) |
| SemPCA | 0.4485 (0.0060) | 0.1764 (0.0612) | 0.4845 (0.0015) | 0.2145 (0.2783) |
| PCA | 0.4396 (0.0034) | 0.2788 (0.1835) | 0.3791 (0.0022) | 0.5219 (0.2354) |
Median F1 Score (IQR) at 10% training ratio for HDFS and BGL under shuffled and sequential splits.
I omitted the measurements of the sequential splits on BGL at lower training ratios, on the basis of the paper stating that performance was already below an acceptable level on 50% sequential splits.
Conclusion
My results validate the paper’s conclusion that transformer models excel on shuffled data but struggle with sequential logs, while classical methods (e.g., SVM) remain consistent. Crucially, the benchmarks also confirm that minor preprocessing choices, particularly sequence grouping, drastically alter outcomes.