Text Mining


Analytical Approaches and Justification


Overview

  • Task: Gather data regarding the most vulnerabilities in Payment assets (e.g. word count).
  • Value: Can determine the most vulnerable assets by the most critical threats.
    • Companies can use this intelligence to flag URLs as risky.
    • While it is ultimately up to the user to determine if they click a link, being able to provide a potential warning is extremely valuable.
  • Potential Value: Identifying specific features from the URLs (ex: number of symbols/digits/etc.) to see if they can assist with accuracy of predictions.

Process

  • Step 1 : Gathered NVD data source CSV file from AWS data repository
  • Step 2 : Copied text from description field into text file for all 2022 records (Approx. 3,500)
  • Step 3 : Preprocessed the data - Stop Words and Stemming by determining
    • Removed irrelevant words “the”, “a”, “of”, etc.
    • Reduced words down to their root word “pay = payment, payments,” “bank = banks”, “
    • Step 4 : Used “databasic.io” to find the most commonly used words to describe the vulnerabilities and create visualizations
    • Next Steps : Find vulnerabilities most relevant to the identified threats and determine what systems, OS, and platforms they specifically refer to. This is how we will reach our stated goal.



    Key Insights and Intelligence


    Results

    • Emerging Threats: Top Three Threats Identified from Data Set:
      • Denial of Service Attacks
      • Cross-Site Scripting
      • Remote Code Execution
    • Types of Infrastructure Threat Actors are targeting to exploit industry:
      • Compromised Operating systems
      • Compromised Applications
      • Compromised Hardware (routers, switches, servers, etc.)
      • Compromised Databases (Oracle, SQL, DB2, etc)