Dataxis LakeHouse [CE] Platform
Apache Parquet: Open-source object storage, S3 compatible.
Apache Iceberg: Open-source table format for analytic datasets.
Apache Gravitino: Open-source metadata lake.
Apache Airflow: Open-source workflow management platform for data pipelines.
Apache Kafka: Open-source distributed event streaming platform for data pipelines, streaming analytics and data integration.
Data Build Tool: Open-source tool for data transformation using SQL.
Apache Superset: Open-source modern data exploration and visualisation platform.
Apache Iceberg: Provides expressive SQL commands, ACID transactions, full schema evolution, hidden partitioning, time-travel & rollback and data compaction features directly to data lakes stored in object storage
Apache Gravitino: Data catalog and federated meta-data lake to manage access and perform data governance for all your data sources (including file-stores, relational databases, and event streams) while safely using multiple engines like Spark, Trino, or Flink on multiple formats.
Apache Spark: Unified engine for batch/ streaming data processing. Execute fast, distributed ANSI SQL queries for dash-boarding and ad-hoc reporting. Perform Exploratory Data Analysis (EDA). Train machine learning algorithms.
Spark Cluster: Spark applications run as independent sets of processes on a cluster, coordinated by the main (driver) program. Multiple jobs run concurrently within each application. The driver schedules tasks on the cluster.
Spark Connect: A decoupled client-server architecture that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol.