Data Transformers to the Rescue

ETL vs Log Forwarding – Why your security future depends on it!

We are now officially in a new era of security engineering.  This era is characterized by big data analytics encompassing AI, machine learning, and data warehousing. In our previous posts, we discussed the need for security operations to have greater visibility into log sources. We need to provide better threat analytics, as well as data that is readily available for forensic needs. This new era requires skills in data onboarding. This is a different capability than the older methods of forwarding logs from one system to another. Today, we are introducing a new way to onboard data into security tools and data lakes. This method is called “Extract, Transform, Load” or ETL for short.

Bottom line up front (BLUF): Using ETL can reduce your SIEM ingestion by 70% while maintaining the security value of logs and increase your ability to save more logs. This saves ingestion and storage costs and best of all it also prepares your data for big time log analytics and our AI security future. So, let’s ETL together.

There are two ETL options we recommend. The first is Apache NIFI, an open-source project by the Apache Foundation that has been around for over eight years. It has excellent community support and is frequently updated.

Wikipedia define Apache NIFI as an open-source data integration tool that is used to automate the flow of data between systems. It provides a web-based interface for designing, building, and managing data flows, making it easy to move data between different systems and applications.

Per nifi.apache.orgNIFI is designed to handle data in real-time and at scale, making it well-suited for use in big data environments. It supports a wide range of data sources and destinations, including databases, messaging systems, file systems, and web services. NIFI provides a variety of processors that can be used to transform, enrich, and route data as it flows through the system. It also supports data provenance, which enables users to track the origin and history of data as it moves through the system.

Another tool we recommend is Cribl. Cribl is a full-featured ETL tool with support and requires a license to use at certain ingestion levels. Cribl may be a better solution for clients who need technical support and more guidance as they work with ETL.

Now, let’s explore how and why we use ETL tools to provide flexibility in our security log source aggregation. Let’s use firewall traffic logs as our example. ETL can make aggregation faster and more reliable. It can also prepare and load the data into our data lake for search and big data analytics.

The “Extract” process includes the following steps. 

Step 1 involves forwarding native logs to an ETL Tool, which in this case is SYSLOG/CEF listening on TCP/UDP port 514. Step 2 is a tagging/enrichment and routing process, where we identify the log source, tag it, and route it to the correct data transformation and routing pipeline.

Then we move to the “Transform” process. This is critical because it lets us identify the data type so we can format it for use in our data lake and security tools. This part of the process is all about memory and compute. Spending resources here makes data in the cloud much more efficient to search and perform analytics on.

Step 3 involves the data’s schema or structure of the data. This allows us to give better structure to the data for use in the cloud which reduces compute cost and search performance.  Here, the system samples the data and infers the schema, so it knows the names of each field. These field names give structure to our data and help the ETL process optimize and format the data into the format into Parquet.

Apache Parquet is a columnar storage format that is designed to optimize the performance of big data processing systems, such as Hadoop and Spark. It is an open-source project developed by the Apache Software Foundation and is designed to efficiently store and process large amounts of structured and semi-structured data. Parquet is optimized for query performance and supports nested data structures, making it well-suited for use with complex data types. It also supports compression and encoding schemes to reduce storage requirements and improve query performance.

Step 4 Now that our ETL tool knows the schema, it will convert the original message from Syslog/CEF and convert it into Parquet. In our experiments, Parquet leads to a 90% reduction in data size, saving ingestion and storage costs in the cloud. The last part is “Load”, where we forward the data into the cloud the tool batches the converted data into one-minute files and securely sends the data to the cloud storage account API. By batching the data, we gain efficiency in network usage and read/writes in the cloud storage, thus reducing the cost of data ingestion.

Now that our data is ingested and stored in the cloud, we can perform analytics to extract value from these logs. In our next post, we will focus on how to extract value from the logs that are in the data lake.

Want to find out the why and how to build your own security data lake? Blog Series

Leave a Reply