Filling Up the Security Data Lake

Dam the Lake!

The foundation of our data “dam” is a pool of information collected from multiple sources. Some data is ingested directly into the data lake storage account. Other data is ingested into the SIEM and later forwarded on to the data lake to meet long-term retention requirements.  Typically, 70% of data ingested into the SIEM is considered high volume but is of low value when it comes to threat detection.

A security data lake with a well-designed data model allows us to build a logging solution that is cost effective and is flexible enough to meet the need to log more and more sources. This increases visibility and provides data when it’s needed for investigation. Having all this data in a central location positions us for the day we can start leveraging artificial intelligence data models.

In this series of posts, we will be focusing on capabilities that are specific to Microsoft Sentinel as well as Azure data and analytics services. We will detail our solution in future posts, but for today we will concentrate on the architecture of data. How data is placed directly into the SIEM or directly into the data lake may differ for each organization, and this is a discussion of “the art of the possible”. Don’t forget to check out our previous posts where we explain why your SIEM must be intimately tied to your data lake.

Let’s begin building our data design, which we will keep simple by using three data classifications: hot, warm, and cold. These terms may become blurred as we incorporate modern data compute and storage services such as Azure Data Explorer (ADX) and Azure Storage Account Gen2 (ADLS). Below is a description of each type. Note that ALL data is searchable in our solution via the Microsoft Sentinel console using the “Search” blade. The ability to investigate or use the data across cloud services is determined by search speed and query capabilities.

Data Classifications (Retention)WhyLocationComments
Hot (3 months)Incident Creation, Correlation, Near Realtime Threat DetectionSIEM (Log Analytics)High Value Log Sources, All Security Alerts from all Security Tools
Warm (4 months – 1 year)Long-term SOC InvestigationsSIEM (Log Analytics)Most SIEM data
Cool/Cold (0 to 10 years)Compliance and Data Analytics (ML/AI)Data Lake (Azure Storage Data Lake)See questions below for suggested direct ingest bypassing SIEM first
Data Classifications

Now the question becomes, which data source candidates are ideal for bypassing ingestion into the SIEM and subsequently going directly into the data lake. This is a discussion of “value”, both in terms of cost and retaining security capabilities for detection and investigation. Here are some ways to determine where the data should be ingested:

1. What is the security value of the logs? Are they high volume but low in detection value?

2. Does the cost of the logs significantly impact the security budget?

3. Is the data typically used for compromise detection or for incident forensics?

4. Does the SIEM have analytics rules out of the box that can immediately provide high-value threat detections?

5. Do you have a compliance need for the data?

Building a foundation for the present and future

Once again, not all organizations will view their data in the same way. To help you think about and identify data sources and where they should be ingested, please see our GitHub project repository here, where there is a master list in the Microsoft Excel format. We believe that by examining your data design in light of your security goals, you will gain a better understanding of the value of your SIEM.

Ask yourself how effective your SIEM is in meeting your needs. Is it just a log aggregation tool, or is it simply a check-the-box compliance tool? Or is it truly a tool that your analysts rely on to do their daily jobs? We submit that a proper data design, coupled with a flexible data analytics platform, is a must-have for the SIEM of the present and future.

Now is the time to re-examine your SIEM architecture and look at ways to build the foundations of your data lake “dam”. Without a data lake, you will not be able to leverage the cost efficiencies of the cloud, nor its ability to provide advanced data analytics such as machine learning, anomaly detection, and artificial intelligence. Please stay with us as we unveil a solution that you can begin adopting now. We believe that we are offering one of the most impactful capabilities for securing your organization. Those who do not adopt a big data strategy will be left without the ability to use their own historical data to work for their defense and detection advantage. Why wait to fill up your lake?

Don’t miss our previous blogs!

What to bring to the Data Lake?

Will your SIEM survive?

2 thoughts on “Filling Up the Security Data Lake