Dataxis LakeHouse [CE]

DataXis LakeHouse [CE] is a self-hosted platform for data engineers and data scientists. It is built on open-source technologies and deployed using containers.

Dataxis LakeHouse [CE] Platform

Components:

Apache Parquet: Open-source object storage, S3 compatible.

Apache Iceberg: Open-source table format for analytic datasets.

Apache Gravitino: Open-source metadata lake.

Integrations:

Apache Airflow: Open-source workflow management platform for data pipelines.

Apache Kafka: Open-source distributed event streaming platform for data pipelines, streaming analytics and data integration.

Data Build Tool: Open-source tool for data transformation using SQL.

Apache Superset: Open-source modern data exploration and visualisation platform.

Features:

Apache Iceberg: Provides expressive SQL commands, ACID transactions, full schema evolution, hidden partitioning, time-travel & rollback and data compaction features directly to data lakes stored in object storage

Apache Gravitino: Data catalog and federated meta-data lake to manage access and perform data governance for all your data sources (including file-stores, relational databases, and event streams) while safely using multiple engines like Spark, Trino, or Flink on multiple formats.

Apache Spark: Unified engine for batch/ streaming data processing. Execute fast, distributed ANSI SQL queries for dash-boarding and ad-hoc reporting. Perform Exploratory Data Analysis (EDA). Train machine learning algorithms.

Spark Cluster: Spark applications run as independent sets of processes on a cluster, coordinated by the main (driver) program. Multiple jobs run concurrently within each application. The driver schedules tasks on the cluster.

Spark Connect: A decoupled client-server architecture that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol.