Introduction:
Actually, I've build and manage a solution of industrial network inventory:
Data sources => ETL jobs (Python) orchestrated by Apache AirFlow => OpenSearch => Dashboard
My second job is to implement a log centralizator system:
Hosts(log agents) => Ingest pipeline => Kafka => Transform Pipeline => OpenSearch
Logs: system, app & network.
Expected target: 15000 Hosts.
The ingest & transform pipeline could be done with logstash or other parsers.
Requirement of the ingest & transform pipeline:
Must:
Manageable/monitor
Scalable
Resilent
SecurityByDefault
Should:
Use the same tech bricks as much as possible to descrease maintenability cost.
First solution: Using Apache Airflow to orchestrate and manage the Ingest and Transform pipelines.
Question:
Can Airflow process that much I/O logs activity?
Can Airflow scale up to reach the final load (15K hosts)?
Second Solution:
If Airflow is not made for that huge I/O pipeline (logs system), I planned to use Kafka Connect instead (new system block => more maintenance cost).
Note: I wish to avoid discution about maintenance cost because it depend of the sector, teams...