|
Theproblemoffindinganomalies(outliers)indatabasesisoneofthemostimportantissuesinmoderndata analysis. One of the reasons is the occurrence of this issue in almost every type of database,includingnumerical,categorical,time,mixed,orgraphicdata.Therearecurrentlymanymethodsoftendedicated to specific data analysis. Finally, this topic is extremely interesting per se, as a researchproblem that intrigues researchers. One of the classic methods of data analysis dedicated to findingthe anomalies in the data is Isolation Forest. However, this method, with a few exceptions, has notbeen modified from the time of its first publication, and, in particular, it has not yet appeared incombinationwiththetypicalfuzzymethodsusedforgroupingsuchasFuzzyC-Means(FCM)clustering.In this study, we thoroughly analyze this approach, as well as several related ones. We examine thepossibilities of this technique and analyze it in detail for characteristics of data (database size, numberof attributes, records, their type, etc.). It is worth noting that FCM allows to obtain membership gradesof elements forming Isolation Forest nodes to clusters on the basis of which these nodes are built.Hence, at the stage of calculating the anomaly scores, this information is effectively used, in particulartoexpresshowmuchagivenelementmaybelongtoagroupofsimilarelements,whichcanbeinferredfrom the characteristics of the cluster in which it lies. In this study, we propose a set of methodsenhancing the Isolation Forest on a basis of Fuzzy C-Means. The results of numerical experimentscarried using 27 various datasets and reported in this paper lead us to the conclusion that FCM canplayapivotalroleinanenhancementofIsolationForestapproachandraisesupthevaluesofparticularmeasures of effectiveness of the anomaly detection methods.
|