Text Mining
Analytical Approaches and Justification
Overview
- Task: Gather data regarding the most vulnerabilities in Payment assets (e.g. word count).
- Value: Can determine the most vulnerable assets by the most critical threats.
- Companies can use this intelligence to flag URLs as risky.
- While it is ultimately up to the user to determine if they click a link, being able to provide a potential warning is extremely valuable.
- Potential Value: Identifying specific features from the URLs (ex: number of symbols/digits/etc.) to see if they can assist with accuracy of predictions.
Process
- Step 1 : Gathered NVD data source CSV file from AWS data repository
- Step 2 : Copied text from description field into text file for all 2022 records (Approx. 3,500)
- Step 3 : Preprocessed the data - Stop Words and Stemming by determining
- Removed irrelevant words “the”, “a”, “of”, etc.
- Reduced words down to their root word “pay = payment, payments,” “bank = banks”, “
- Step 4 : Used “databasic.io” to find the most commonly used words to describe the vulnerabilities and create visualizations
- Next Steps : Find vulnerabilities most relevant to the identified threats and determine what systems, OS, and platforms they specifically refer to. This is how we will reach our stated goal.
- Emerging Threats: Top Three Threats Identified from Data Set:
- Denial of Service Attacks
- Cross-Site Scripting
- Remote Code Execution
- Types of Infrastructure Threat Actors are targeting to exploit industry:
- Compromised Operating systems
- Compromised Applications
- Compromised Hardware (routers, switches, servers, etc.)
- Compromised Databases (Oracle, SQL, DB2, etc)