Can a new method of analysing big datasets help organisations fight off cyber attacks?
Governments and businesses are facing growing challenges to their organisations through cybersecurity attacks as they increasingly rely on network-connected platforms and the internet.
Toll, Microsoft, the Australian Department of Defence, Facebook, Zoom and Bluescope Steel are just some of the organisations that have been targeted.
Not only must organisations have secure cyber systems, but they also need to have safeguards in place to detect these attacks quickly.
Analysing the automatically generated logs from network and system devices provides detailed information into how secure these systems actually are. This data can also be used to predict such attacks.
Predicting and identifying attacks
But the real challenge is the sheer volume of the data set. The data size means that predicting or even identifying attacks is difficult.
In 2014, a team led by Professor Rob Hyndman, Head of Department of Econometrics and Business Statistics at Monash Business School, was approached by Yahoo.
Yahoo was looking for ways to prevent and identify cybersecurity attacks on its mail servers.
“They needed to know where their engineers should focus attention to check if there was an attack or a faulty server,” Professor Hyndman says.
Professor Hyndman developed a method combining time-series features with Principal Component Analysis.
The approach allowed him to simplify a very large and complicated data set into just two measures for each server which captured much of the variation in the data.
Identifying anomalies
He then used a method he developed in the mid-1990s to find the most unusual observations using only these two measures.
This statistical method of using time series features with Principal Component Analysis to identify anomalies has also been used more broadly, including the identification of problems with water quality.
Working in conjunction with the Queensland Government’s Department of Environment and Science and researchers from QUT and RMIT, Professor Hyndman is currently analysing data from sensors placed along the river networks that run to the Great Barrier Reef.
The sensors record data on a range of measures such as nitrate levels within the water and enable scientists to monitor the health of the river network
“Because of the Government’s intention to roll out a large number of these sensors, it is not practical for someone to drive out and check these sensors for any anomalies,” he says.
“Instead we are using statistical methods to identify any measurements that show up as unusual. This may be due to pollution in the river or malfunction in the sensor itself.
“We know that the reading of one sensor should be similar to the one upstream. So, we are using a prediction-based anomaly technique to identify any usual readings. Automating these readings is resulting in quicker investigations and is much more cost-effective.”
Network detection
In order to prevent cyber attacks, Professor Hyndman says many organisations use Network Intrusion Detection Systems (NIDS).
Generally, there are two main categories of these systems: signature-based NIDS and anomaly-based NIDS.
Signature-based NIDS identifies misuse by scanning the activities’ signatures against a database of known malicious activities signatures.
“Signature-based NIDS identifies attacks that are similar to previous attacks,” says Professor Hyndman. “So, if an attacker is smart, they just have to make a slight change and they can avoid being detected.”
In an anomaly-based NIDS, a profile of normal activities is developed using statistical models and future activities are then compared against this profile.
Anything unusual is identified as a potential anomaly, so a new type of attack can still be detected. This is known as an ‘unsupervised’ algorithm because it works without human classification of what is an attack and what is not.
Yet, Professor Hyndman says there still needs to be human involvement to review and classify potential anomalies as normal or malicious to ensure an appropriate model of ‘normal activity’ can be developed. This does not scale well for very large data sets.
Training dataset problems
Unsupervised learning methods do not require a clean dataset which is free of attacks. They work in two phases. The training phase is where a model is used to learn the normal behaviour of a system using a normal activity dataset.
In the monitoring phase, the model is used to evaluate the behaviour of future data and any identification of an ‘anomaly’ that is subsequently classified as normal is included as part of the profile.
“The advantage of this method is that it has the ability to identify never-seen-before attacks,” Professor Hyndman says
“However, these algorithms can require a lot of computing power and memory.”
Principal Component Analysis can be used for these algorithms because it is fast and scalable. However, it can still miss important features in the data.
In his most recent work, Professor Hyndman is working with Dr Sevvandi Kandanaarachchi (RMIT) to find better ways to reduce a large amount of data down to a small number of measures in a way that highlights the unusual observations.
“Ideally, we want a fast method that scales to huge data sets and allows us to automatically spot cyber attacks, or find anomalies in river network sensors, in real-time,” Professor Hyndman says.