Authors :
Ken Carlo D. Javier; Allyza Maureen P. Catura; Jonathan C. Morano; Mark Christopher R. Blanco
Volume/Issue :
Volume 9 - 2024, Issue 3 - March
Google Scholar :
https://tinyurl.com/yc35yrkh
Scribd :
https://tinyurl.com/3vzu8r84
DOI :
https://doi.org/10.38124/ijisrt/IJISRT24MAR2052
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
Metamorphic malware poses a significant
threat to conventional signature-based malware detection
since its signature is mutable. Multiple copies can be
created from metamorphic malware. As such, signature-
based malware detection is impractical and ineffective.
Thus, research in recent years has focused on applying
machine learning-based approaches to malware
detection. Profile Hidden Markov Model is a probabilistic
model that uses multiple sequence alignments and a
position-based scoring system. An enhanced Profile
Hidden Markov Model was constructed with the
following modifications: n-gram analysis to determine the
best length of n-gram for the dataset, setting frequency
threshold to determine which n-gram opcodes will be
included in the malware detection, and adding consensus
sequences to multiple sequence alignments. 1000 malware
executables files and 40 benign executable files were
utilized in the study. Results show that n-gram analysis
and adding consensus sequence help increase malware
detection accuracy. Moreover, setting the frequency
threshold based on the average TF-IDF of n-gram
opcodes gives the best accuracy in most malware families
than just by getting the top 36 most occurring n-grams, as
done in previous studies.
Keywords :
Consensus Sequence, Metamorphic Malware, N- Gram Analysis, Profile Hidden Markov Model, TF-IDF
Metamorphic malware poses a significant
threat to conventional signature-based malware detection
since its signature is mutable. Multiple copies can be
created from metamorphic malware. As such, signature-
based malware detection is impractical and ineffective.
Thus, research in recent years has focused on applying
machine learning-based approaches to malware
detection. Profile Hidden Markov Model is a probabilistic
model that uses multiple sequence alignments and a
position-based scoring system. An enhanced Profile
Hidden Markov Model was constructed with the
following modifications: n-gram analysis to determine the
best length of n-gram for the dataset, setting frequency
threshold to determine which n-gram opcodes will be
included in the malware detection, and adding consensus
sequences to multiple sequence alignments. 1000 malware
executables files and 40 benign executable files were
utilized in the study. Results show that n-gram analysis
and adding consensus sequence help increase malware
detection accuracy. Moreover, setting the frequency
threshold based on the average TF-IDF of n-gram
opcodes gives the best accuracy in most malware families
than just by getting the top 36 most occurring n-grams, as
done in previous studies.
Keywords :
Consensus Sequence, Metamorphic Malware, N- Gram Analysis, Profile Hidden Markov Model, TF-IDF