What to bring to the Data Lake?

“Yes – it’s more than bathing suit”

Security engineering teams need to develop new skills to provide their security analysts with the necessary depth of data and analytics to perform their jobs effectively. Analysts require this data to be readily available in the SIEM during an incident. We must reduce the speed of triage to pave the way for automation to remediate the event. Auto IR is the present and future. As we previously discussed, your SIEM must be tightly integrated with your data lake. To achieve this integration, you will need to acquire a few new capabilities and potentially partner with data analytics teams. These new skills will provide the business with increased capability and help better secure the organization. What are these capabilities?

First, we need a flexible and efficient way to load data directly into the data lake. This requires the use of Extract, Transform, and Load (ETL) tools such as Apache NIFI, LogStash, Cribl, Azure Monitor, and Azure Data Factory. These tools enable you to forward almost any data source. Then transform it and load it into the Azure Storage Data Lake in a highly performant, optimized format that is ready for big data analytics and efficient storage. Which ETL tool you choose will depend on your team’s ability to manage the particular solution. This will open up various possibilities to gain visibility into data sources that were previously too expensive or difficult to put directly into the SIEM. You can expand your visibility into all sorts of data sources that will provide operations, infrastructure, compliance, and security teams with greater access to data.

After obtaining the data, the next step is to build what we call the “Integration Layer.” This capability makes the data lake data format (schema) known to the analytics and presentation layer tools, such as Power BI, Sentinel, and Log Analytics. Once the integration layer is set up, we can perform analytics, querying, and reporting on the data. Essentially, we have decoupled data storage from data analytics and compute. This alone allows us to optimize costs from both a compute and storage perspective.

Excellent! With the integration layer and schema of the data lake readily available, we can now extract value from the data. This is where things become exciting and interesting, and where we provide the most value to the business and security operations. Let’s take firewall and/or network traffic data as an example of what we can do next with our analytics capabilities. We will set up a logic app that accesses the data lake through our integration layer and summarizes all network connections by source, destination, and application. This summary will then be written into the Sentinel workspace and be available for threat detection against your analytic rules.

This is just one simple example of how we can extract the full value of logs for security operations. There are various use cases we can implement to drive value back into the SIEM/Sentinel. Additionally, note that this log data will be searchable by the security analyst at any time for hunting, incident investigations, and, if necessary, can be fully reconstructed into Sentinel hot queryable data.

To summarize, in this post, we have outlined the need to add new capabilities or, as we say, learn to exercise new muscles that will yield significant dividends for your organization. Challenge yourself to learn about ETL and understand why data formats like Parquet and Delta Parquet are crucial to your success in going to the lake. In our next post, we will guide you through our architecture, which leverages Azure tools such as Sentinel, Log Analytics, Azure Data Explorer, and Azure Data Lake Gen2 storage account.

Check our earlier posts in this series.