Cybersecurity threat analysis and prediction of High-speed Railway Signal System based on knowledge graph

Published: April 05, 2021

Introduction

1. In recent years, the signal system of High-speed Railway is facing unprecedented security threats. Since it lacks effective prediction or warning mechanisms for the Advanced Persistent Threats(APT), this project proposes study on cybersecurity threat analysis and defense technology of railway signal system

2. This project mainly supported my master’s thesis, titled "Analysis and Detection of Cyber Threat Behavior in Train Control System Based on Knowledge Graph"

The specific structure of my work is shown below:

Threat Analysis

Introducton

1. How to testify there are security issues in the rail transit? Maybe we can conduct penetration tests in a simulation environment. But how to figure out the route and goal of cyberattacks within such a complex system? The answer is to firstly conduct a theoretical threat analysis

Design

1. I’ve proposed a novel methodology for the coalescence analysis of safety and security in cyber-physical systems, namely Process-Oriented and Coalescent Analysis (POCA). Different from the traditional object-oriented methods that directly start the analysis with system components or communication links, our method mainly focuses on the specific working process of the object, which is process-oriented analysis

The overall framework of this methodology is shown below:

2. POCA consists of 2 parts, which are oriented toward functional safety and cyber security respectively. POCA achieves a coalescence of these 2 attributes by drawing the 2 parts together

The first part, as system service process analysis, abstracts service processes into analyzable objects by referring to the STPA method, which lays a foundation of “process-oriented”
Another part, as system cyber threat analysis, identifies potential cyber threats based on outputs of the first part according to common security analysis methods

3. I’ve written an academic paper about the work of this part for publication

A comparison of POCA with some previous methods is shown below:

Method	Safety	Security	System Service	Component Constraint	Threat Scenario	TTPs Analysis	Remediation
Attack Trees	-	√	-	√	√	√	-
OCTAVE	-	√	-	-	√	-	√
STPA	√	-	√	-	√	-	√
TARA	-	√	-	√	-	√	√
Extended TVRA	√	√	-	√	-	√	-
Threat Profile	-	√	-	-	√	√	-
STPA-Sec	-	√	√	-	√	-	√
STPA-SafeSec	√	√	√	√	√	-	√
POCA	√	√	√	√	√	√	-

Experiment

1. I’ve applied POCA to the Temporary Speed Restriction (TSR) scenario of the train control system, and indeed identified several threat scenarios against the TSR service

Since this paper has not been published, here I could only provide part of the analysis output

Threat Simulation

Introduction

1. This part not only verifies the usability of POCA outputs, but also provides a dataset for subsequent research. Due to the closed nature of the railway system, only internal attacks are likely to be implemented. Therefore, common datasets that include external attacks such as Web penetration are not suitable for our demand. Meanwhile, the datasets include attacks against railway systems are difficult to obtain. Thus, we have to generate our own dataset

Design

Simulation range

1. I’ve built a simulation range in my laboratory according to the Signal Safety Data Network(SSDN) of train control system (networking processes):

Click to view range component

Network Equipment
- The range uses switches to carry communication between the ground equipment, and uses router to connect ground equipment and Centralized Traffic Control (CTC)
Data Generation Area(Environment data)
- This area is composed of 5 servers, each running several virtual machines that simulate the ground equipment of the train control system. Since the network cards of the virtual machines are configured in "bridge mode", these servers also act as "switches"
- The LAN of switch1 is configured as "domain"
Attacker
- The attacker operates a Kali Linux host which has already connected to the ISDN server within the LAN of switch 3
Data Collection and Analysis Terminal
- This terminal is a host that installs the Elasticsearch engine, which is responsible for collecting and processing data from the data generation area
- The knowledge graph constructed later is also deployed on it

Click to view range configuration

Each virtual machine is installed with software to simulate services of the train control system
Sysmon and NXlog are installed to realize syslog generation and forwarding separately

Device	Operating System	IP	Security Tool	Main Software
RBC active	Windows 10	192.168.4.203	360 Security Windows Defender	RBC Simulation Software Sysmon+NXlog
RBC standby	Windows 10	192.168.3.105	Same with active	Same with active
ISDN Server active	Ubuntu 18.04	192.168.4.206	ClamAV	ISDN Simulation Software SysmonForLinux+NXlog
ISDN Server standby	Ubuntu 18.04	192.168.3.106	Same with active	Same with active
TSRS active	Windows 10	192.168.4.200	360 Security Windows Defender	TSRS Simulation Software Sysmon+NXlog EasyFileSharing
TSRS standby	Windows 10	192.168.3.103	Same with active	Same with active
TSR Interface Server	Windows Server 2008	172.110.2.11	360 Security Windows Defender	Sysmon+NXlog
CTC active	Windows 10	172.110.2.12	360 Security Windows Defender	CTC Simulation Software Sysmon+NXlog
CTC standby	Windows 10	172.110.2.13	Same with active	CTC Simulation Software Sysmon+NXlog EasyFileSharing
Kali Linux	Kali Linux 2020	192.168.4.211	-	Metasploit MITRE CALDERA
ELK mainframe	Windows 10	10.10.10.230	360 Security Windows Defender	Elasticsearch

Simulation attack

1. I designed a complete attack strategy against the system based on the aforementioned POCA output “threat scenario2”. It covers all 12 tactics and includes 18 techniques of MITRE ATT&CK

2. I separately implemented the pre and post penetration by Kali in the range, and all syslogs (total 3 days) are saved as the raw dataset

Click to view dataset example

Development

Log labelling

1. The highly-textual information contained in logs will greatly increase the workload and difficulty of analysis, so it’s necessary to preprocess the dataset: try to add a “label” to each log to generalize its behavior

2. A configuration file of Sysmon have actually helped us take the first step. It can map logs to the ATT&CK techniques in the RuleName, which could properly represent their security behavior

3. However, the above work is kind of rudimentary, since about 93% logs will be labeled. In other words, its strong generalization results in low identification of real attack behaviors

For example, all operations achieved by Powershell will be labeled as “T1059.001 Powershell”, while they can actually be divided more specifically

4. Therefore, we’ve written another set of detection rules (repository) specialized for the commandline (powershell/cmd/terminal) inputs. It integrates 770 attack abilities of MITRE CALDERA platform, covering 11 tactics and about 240 techniques (60%) of ATT&CK matrix

By running the test_in_my_case.py in the repository, it will overwrite the RuleName of some logs with more precise “ATT&CK technique_ids”, and then export the processed dataset from ELK as syslog.csv at once

Graph construction

Introduction

1. The expanding scale of cyberspace leads to a sharp increase in the amount of security-related data, which are diverse, fragmented, and heterogeneous. The main challenge in current security analysis is not data shortage, but how to effectively combine information from multiple sources

2. Previous chapters have generated various data, such as threat modeling, system architecture, and system logs. To effectively utilize these outputs, we need to address the multi-source heterogeneous issue. Knowledge graph(KG), with its excellent data integration, correlation, and visualization capability, becomes the preferred technology

3. Knowledge graph is a large-scale semantic model composed of vertices and edges. It can intuitively model various security scenarios. This section intends to merge all the aforementioned outputs with existing achievements to construct a cybersecurity KG of train control system

Design

Ontology structure

1. A review article “Recent Progress of Using Knowledge Graph for Cybersecurity” provides us with a general architecture of CSKG, which consists of 4 dimensions:

2. On the basis of this architecture, we’ve designed the following ontology structure of our CSKG of train control system. The next 4 sections will separately discuss each dimension

Click to view meaning of edges between dimensions

Dimension	Relationship	Description
CTI-Knowledge data	CTI-TTP	General security CTI is described by "technique" and "tactic" in knowledge data
Behavior data-Knowledge data	Syslog-TTP	"Technique" in knowledge data is used as the label to generalize the behavior of "syslog" in behavior data
Environment data-Behavior data	Asset-Process	"Process" in behavior data can be correlated with "asset" in environment data based on IP address
Environment data-CTI	Asset-CA	"Control action" in specific railway CTI is carried out by "asset" in environment data
Environment data-CTI	Asset-Weakness	"Weakness" in specific railway CTI exists in "asset" in environment data

Click to view meaning of edges within dimensions

Dimension	Relationship	Description
Knowledge data	Tactic-Technique	Tactic includes multiple techniques
	Technique-Capec	ATT&CK techniques and CAPEC attack patterns have overlaps
	Technique-Technique	Parent-child relationship exists among techniques
	Capec-Capce	Parent-child relationship exists among attack patterns
	Technique-Technique_mitigation	Mitigations can reduce the impact of techniques
	Technique-Technique_detection	Detection methods can identify traces of techniques
	Capec-Cwe	Attack patterns exploit weaknesses in components
	Cwe-Cve	Weaknesses in components include multiple vulnerabilities
CTI	CTI-CTI	Sequential relationship exists among attack behaviors
	Accident-Hazard	System hazards can cause accidents
	Hazard-Service	Abnormalities in service can cause system hazards
	Service-CA	Services include multiple control actions
	CA-Weakness	Control actions may lead to unsafe control actions
	Weakness-Weakness	Parent-child relationship exists among unsafe control actions
	TS-Weakness	Threat scenarios include multiple unsafe control actions
	TS-TS	Parent-child relationship exists among threat scenarios
Environment Data	Asset-Asset	Connection relationship exists among assets
Behavior Data	Syslog-Syslog	Chronologic relationship exists among syslogs
	Process-Syslog	Processes include multiple logs
	Process-Process	Access relationship exists among processes
	Parentp-Childp	Parent-child relationships exists among processes

① Knowledge data

1. Knowledge data from common cybersecurity knowledge bases, such as ATT&CK, CAPEC, CWE , CVE, MITRE Engage and D3FEND, are already linked together by researchers from MIT as an open source graph “BRON”. So I directly imported BRON as the knowledge data

The ontology structure of BRON’s main part is shown below:

② General security CTI

1. This dimension is represented as the “cyber threat intelligence(CTI)” ontology. At present, most common CTI exist in unstructured or semi-structured forms. In order to construct the KG of this dimension, we need to extract those CTI into structured data

2. MITRE developed an open source platform “TRAM”, which can associate the input attack procedure (left) with ATT&CK techniques and tactics (right) to help generate CTI in a structured form as “TTPs”

3. Through the TRAM platform and manual verification, we’ve generated the general security CTI of some common attacks, which could be easily imported into graph

The figure below shows an example of the extracted “file stealing” attack

② Specific railway CTI

1. This CTI refers to the threat modeling results of target system. In this project, it is generated by POCA. The ontology structure of POCA outputs is shown below:

Results such as “control action”, “hazard” and “threat scenario” can be presented as ontologies
Results like “risk score” and “description” can be used as attributes of ontologies

2. The following is a conceptual display of entities and relationships in the specific railway CTI:

A small part of the content in the figure is different from the actual one

③ Environment data

1. This dimension is represented as the “asset” ontology. It is generally based on the topology of target system and includes attributes such as OS and IP of the equipment. It not only models the physical composition of target system, but also acts as a bridge between behavior data and CTI

The environment data of this project has been provided here

④ Behavior data

1. The behavior data of this project is the syslog generated by Sysmon, and “behavior” is represented by the ATT&CK technique label

2. Details of this part are summarized in this repository. In short, logs whose EventID = 1 (ProcessCreate) or 10 (ProcessAccess) contain info that separately represent 2 kinds of process relations: “parentp-childp” and “process-process”. We can utilize them as well as the inherent “time” as the 3 major relations to form a syslog ontology structure:

Development

1. As for the development, knowledge graph can be constructed through graph database. This project chooses the ArangoDB, and its basic construction processes are recorded here

2. Here, I take the specific railway CTI and environment data 2 dimensions as examples to display the actually constructed graph:

3. Basic indicators of the constructed KG:

Click to view indicators

Indicator	Definition	Value	Explanation
Nodes	Number of nodes	$500966$	-
Edges	Number of edges	$1685976$	The graph is a directed graph
Isolated Nodes	Number of nodes with no edges	$22368$	As some nodes in BRON are isolated, this graph is a disconnected graph
Network Density	Ratio of actual edges to possible edges	$6.72\times10^{-6}$	The density is close to 0, indicating that the graph is sparse
Average Degree	The sum of degrees of all nodes divided by the number of nodes	$6.73$	On average, each node has connected with 6.73 edges
Maximum Depth	The longest path from the root node to a leaf node	$10$	The graph has the maximum depth when its root is the "accident" layer and its leaf is the "CVE" layer
Network Diameter	The longest shortest path between any two nodes	$14$	Such path is found between an "accident" node and a "CVE" node

Anomaly Detection

Introduction

1. Modern cyberattacks are often carried out in a concealed and highly variable style, which lack obvious features or patterns. Thus, traditional methods are difficult to effectively identify them. Therefore, behavior-based anomaly detection has become an important idea, which identifies potential threats by modeling system behaviors and detecting abnormal ones on this basis

Design

Detection framework

1. From a macro perspective, the project studies 2 types of behavior: abstract threat behavior derived through theoretical analysis (CTI in KG), and specific system behavior collected through practical experiments (behavior data in KG). The overall threat detection idea is “based on CTI dimension, supplemented by other dimensions, detect anomalies in behavior data dimension”

2. Based on this idea, I’ve designed a behavior-based anomaly detection framework shown above, which defines 3 kinds of behaviors according to the threat level from low to high:

System device behavior
- It is the complete set of behavior data, including 2 subsets of mid & high-level behavior
- Due to the high proportion of labelled syslogs, it is hard to identify the abnormal data hidden in massive normal data at this level
Security threat behavior
- It is detected when some system device behaviors satisfy a specific attack pattern recorded in general security CTI
- The idea is: search for combination of syslogs in which their “ATT&CK technique labels” match the “techniques” used by the attack pattern
Service abnormal behavior
- It is detected when some security threat behaviors further conform to a specific threat scenario described in specific railway CTI
- The idea is: among detected attack patterns, search for those that involve operations of certain “service command file” exploited by the threat scenario

Detection modes

1. As for the application, this framework can perform 2 detection modes:

Bottom-up means the detection is from the low-level all the way up to the high-level, and directly achieves the anomaly detection
However, considering the complexity of cyber attacks and the incompleteness of CTI in the KG, high-level abnormal behavior usually cannot be directly mapped through the bottom-up detection. Therefore, the more flexible bi-directional detection should be widely applied

2. The program flowchart of following experiment based on the detection framework is:

Experiment

1. Graph traversal is the technical carrier of this detection experiment. Since our KG has a relatively large depth, the Breadth-First Search (BFS) is more applicable and efficient

ArangoDB’s query language AQL has integrated multiple basic algorithms including BFS, so we could develop detection functions based on it

Security threat behavior detection (low → middle)

1. The general idea for detection at this level is:

Traverse within “CTI” to obtain all attack patterns’ entities (CTI)
Traverse to “asset” through “syslog” to find related logs for each pattern (behavior data)
Output the traversal path as the detection result

FOR vertices,edges,paths IN ANY 'CTI/steal1'
                CTICTI,
                CTITTP,
                OUTBOUND TechniqueTechnique_mitigation,
                INBOUND SyslogTTP,
                INBOUND ProcessSyslog,
                INBOUND ParentpChildp,
                INBOUND AssetProcess
    OPTIONS {bfs:ture}
RETURN paths

2. After executing codes similar to above, 2 kinds of attack patterns were successfully detected:

Lateral movement

File stealing

Service abnormal behavior detection (middle → high)

1. The basis for mapping from mid-level to high-level is the service command files. The general idea for detection at this level is:

Based on the traversal result of security threat behavior detection, set filter conditions for the specific command file to continue traversing upwards

2. For the detected “lateral movement”, it does not involve any command file; For the “file stealing”, the commandline input of “syslog/23647” (corresponding to the 3rd step of this attack pattern) indicates that it used the Copy-Item to copy (steal) the “TSR_Cancel.CONF” command file to a folder called “staged”:

3. Based on this clue, the “file stealing” attack may be further mapped to the high-level service abnormal behavior

FOR vertices,edges,paths IN 1..8 ANY 'CTI/steal3'
                     CTITTP,
                     INBOUND SyslogTTP,
                     INBOUND ProcessSyslog,
                     INBOUND ParentpChildp,
                     INBOUND AssetProcess,
                     INBOUND AssetCA,
                     INBOUND CAWeakness,
                     OUTBOUND TSweakness
     OPTIONS {bfs: true}
     FILTER p.vertices[*]._key ANY == "23647"
        AND p.vertices[*].command ANY == "TSR_Cancel"
        AND p.vertices[*].security_threat ANY == "Leakage"
RETURN paths

4. After executing the above code, only the first step of threat scenario2 (node “TS2”) was matched, suggesting that bottom-up detection is insufficient for our dataset. Therefore, bi-directional detection is required to further trace subsequent steps of threat scenario2

Service abnormal behavior detection (high → low)

1. The general idea for detection at this level is:

Traverse within “threat_scenario” to obtain remaining threat scenarios’ entities (CTI)
Traverse downward to “syslog” to find related logs for each scenario (behavior data)
Output the traversal path as the detection result

2. At first, node “TS2.1” was read, which involves tampering with the TSR cancel command file. However, since this file was tampered locally by the attacker, no relevant logs can be detected

3. Then, continue traversing to the “TS2.1.1” node, which involves leakage of the TSR execution reminder command file. The corresponding AQL code is:

FOR vertices,edges,paths IN 1..7 ANY 'threat_scenario/TS2'
                     TSTS,
                     INBOUND TSWeakness,
                     OUTBOUND CAWeakness,
                     OUTBOUND AssetCA,
                     OUTBOUND AssetProcess,
                     OUTBOUND ParentpChildp,
                     OUTBOUND ProcessSyslog
     OPTIONS {bfs: true}
     FILTER p.vertices[*]._id ANY == "asset/8"
        OR p.vertices[*]._id ANY == "asset/9"
     FILTER p.vertices[*].command ANY == "TSR_ExecutionReminder"
        AND p.vertices[*].security_threat ANY == "Leakage"
     FILTER p.vertices[*].TargetFilename
        AND p.vertices[*].TargetFilename LIKE "%TSR_ExecutionReminder%"
RETURN paths

4. Through the above code, abnormal behavior of operating such command file was detected:

Firstly, RuleName field of “syslog/24049” indicates the involvement of script and payload, suggesting that it is highly likely a trace of attacker monitoring the TSR execution reminder
Furthermore, the TargetFilename field records the monitored file and its location as “C:\Users\Administrator\AppData\Roaming\Microsoft\Windows\Recent”, which is typically used to store shortcuts of recently used files
Therefore, it can be inferred that the script used by attacker doesn’t directly monitor the original command file, but a shortcut in another directory

5. Continue traversing to the “TS2.1.1.1” node, which involves the counterfeit of TSR execution command file. The corresponding AQL code is:

FOR vertices,edges,paths IN 1..8 ANY 'threat_scenario/TS2'
                     TSTS,
                     INBOUND TSWeakness,
                     OUTBOUND CAWeakness,
                     OUTBOUND AssetCA,
                     OUTBOUND AssetProcess,
                     OUTBOUND ParentpChildp,
                     OUTBOUND ProcessSyslog
     OPTIONS {bfs: true}
     FILTER p.vertices[*]._id ANY == "asset/8"
        OR p.vertices[*]._id ANY == "asset/9"
     FILTER p.vertices[*].command ANY == "TSR_Execution"
        AND p.vertices[*].security_threat ANY == "Counterfeit"
     FILTER p.vertices[*].TargetFilename
        AND p.vertices[*].TargetFilename LIKE "%TSR_Execution%"
RETURN paths

6. Through the above code, abnormal behavior of operating such command file was detected:

Firstly, “syslog/4634” corresponds to event 11, which is generated when a new file is created or the original file is overwritten. It is consistent with the fact that the attacker replaced the “TSR execute” with the stolen “TSR cancel” command file
Secondly, the process path recorded in Image field includes cmd.exe, indicating that the attacker replaced file through remote commandline
Then, same abnormal behavior was detected on CTC active (asset8) and standby (asset9), indicating that the attacker had replaced files on both devices
Finally, TargetFilename field clearly reveals that the attacker’s target is TSR_execution.CONF

Assessment

Detection result

1. We collected the detection results of our KG model and performed detection on the same dataset using a log analysis platform based on the ELK engine as a reference

2. As shown below, the log analysis platform detected 40% of all the attack processes, while the KG identified 60%, with a higher availability in detecting post-penetration attack

Stage	Process	Description	Detected by KG?	Detected by ELK?
Pre-penetration	1	Exploit vulnerabilities to establish connection with CTC from specific port	×	√
	2	Elevate permissions on CTC through MSF commands such as process migration	×	×
	3	Establish the proxy link between Caldera and CTC through the Meterpreter shell	√	√
	4	Use the Mimikatz to steal the name and password of CTC domain administrator	×	×
	5	Use stolen credentials to realize the lateral movement between the CTC active and standby	√	√
Post-penetration	1	Search in all agent hosts for reserved historic "TSR cancel" files	√	×
	2	Copy searched files to the Kali and delete the attack trace	√	√
	3	Tamper the content of the "TSR cancel" configuration file locally for forgery	×	×
	4	Remotely upload a PS script to CTC standby to monitor the change of its "execution reminder" file	√	×
	5	Once the targeted file changed, replace the "TSR execution" file with the counterfeit one	√	×

3. To further demonstrate the advantage of behavior-based detection, we take the detection of “Post-penetration process 5” as an example to analyze the difference between the 2 approaches:

Actually, the log analysis platform contains detection rules for the “malicious file replacement”, but primarily based on the “source IP”, “file name”, and “replaced content”. However, in our designed attack, the attacker’s IP was pre-set as legitimate, and the target file was only replaced by another service command file, without any malicious code. This case indicates that the platform’s feature-based detection can be easily bypassed
On the contrary, the detection of our KG model is behavior-based. Regardless of changes in features such as “IP”, “file name”, or “file content”, as long as the adversary still exhibits the behavior as “replacing command file”, it will be recognized as an anomaly in system service

Detection efficiency

1. To preliminarily evaluated the detection efficiency, we conducted repeated detection of “lateral movement” and “file stealing” attacks using the KG model on 3 groups of datasets (primarily varying in size), and performed similar operations using the aforementioned log analysis platform

2. As shown below, considering that the KG establishes associations (shortcuts) among logs, even with a larger amount of data (i.e., knowledge dimensions apart from logs), it still exhibits higher efficiency compared to the log analysis platform’s sequential query approach

Dataset (Number of logs)	Detection Target	Platform (Data Format)	Minimum Time (ms)	Maximum Time (ms)	Average Time (ms)
Dataset 1 (36970)	Lateral Movement	ELK (JSON)	36.87	68.81	50.44
	Lateral Movement	KG (Graph)	1.17	1.48	1.35
	File Stealing	ELK (JSON)	72.02	83.43	78.29
	File Stealing	KG (Graph)	3.98	5.53	4.68
Dataset 2 (50440)	Lateral Movement	ELK (JSON)	33.81	57.20	45.01
	Lateral Movement	KG (Graph)	1.95	2.18	2.06
	File Stealing	ELK (JSON)	75.56	91.64	83.40
	File Stealing	KG (Graph)	10.29	11.93	11.05
Dataset 3 (525776)	Lateral Movement	ELK (JSON)	76.31	99.83	87.28
	Lateral Movement	KG (Graph)	47.80	67.21	57.17
	File Stealing	ELK (JSON)	105.87	143.68	127.51
	File Stealing	KG (Graph)	96.47	124.71	110.29

3. However, the test data also demonstrates that the query efficiency of graph database is more significantly affected by the data size. This highlights the importance of deploying distributed graph database to ensure optimal performance when handling large data volumes

Conclusion

1. This thesis focuses on threat modeling and anomaly detection research for train control system. The main contributions are:

A novel threat modeling approach is proposed, which integrates security analysis with the process of system service to achieve the coalescence of functional safety and cyber security of cyber-physical systems
A cybersecurity knowledge graph of railway train control system is constructed, which provides researchers with a global analysis perspective by using multidimensional data to model the behavior of railway systems
A abnormal behavior detection framework is proposed based on the constructed knowledge graph, which can effectively detect major attack behaviors hidden in system logs and provide intelligible visual outputs

2. Although certain results have been achieved, there are still limitations and researchable issues:

The POCA provides a relatively simple description of the attack patterns involved in threat scenarios, which directly leads to the inability to effectively associate 2 types of CTI when constructing the knowledge graph
Manual analysis is used to assist the graph model in the bi-directional detection. With the development of AI technology, the attack and defense scenarios will gradually become intelligent. Our graph model should also integrate a variety of model-based intelligent technologies to achieve fully automated analysis and detection