Identification of groups of visited Internet resources for detection of internal cyberthreats source

Sergey V. Isaev, Denis Y. Doncov

Institute of Computational Modeling SB RAS

The protection of the corporate network is an important aspect of the successful functioning of the organization. In this paper, the cybersecurity of the internal network perimeter is studied using the example of the Krasnoyarsk Scientific Center of the Siberian Branch of the Russian Academy of Sciences. There are various tools for preventing cyber threats and analyzing visited Internet resources, but their performance and applicability strongly depend on the amount of input data. The article discusses existing methods for identifying network threats by analyzing proxy server logs. The division of Internet users into thematic groups to detect anomalies is investigated. A method for clustering Internet resources is proposed, aimed at reducing the volume of input data by excluding groups of safe Internet resources or selecting only suspicious Internet resources. The proposed method consists of the following steps: data preprocessing, user session selection, data analysis, and interpretation of the results. The source data is the log entries of the proxy server. At the first stage, useful data for analysis are selected from the initial data, after which the continuous data stream is divided into small portions (sessions) using the kernel density estimation method. At the second stage, soft clustering of the used Internet resources is performed by applying the topic modeling method. The result of the second stage are unallocated groups of Internet resources. At the third stage, with the help of an expert, the results obtained are interpreted by analyzing the most popular Internet resources in each group. The method has many settings at each stage, which allows you to configure it for any format and specifics of the input data. The scope of the method is not limited. It can be used both as an additional preprocessing step to reduce the amount of input data and to detect anomalous data.

cluster analysis, topic-modeling, cybersecurity