Top 5 Key Features of Apache Iceberg for Modern Data Lakes

Big data has significantly evolved since its inception in the late 2000s. Many organizations quickly adapted to the trend and built their big data platforms using open-source tools like Apache Hadoop. Later, these companies started facing trouble managing the rapidly evolving data processing needs. They have faced challenges handling schema level changes, partition scheme evolution, and going back in time to look at the data. 

I faced similar challenges while designing large-scale distributed systems back in the 2010s for a big tech company and a healthcare customer. Some industries need these capabilities to adhere to banking, finance, and healthcare regulations. Heavy data-driven companies like Netflix faced similar challenges as well. They invented a table format called “Iceberg,” which sits on top of the existing data files and delivers key features by leveraging its architecture. This has quickly become the top ASF project as it gained rapid interest in the data community. I will explore the top 5 Apache Iceberg key features in this article with examples and diagrams. 

1. Time Travel

Time travel in Apache Iceberg table format

Figure 1: Time travel in Apache Iceberg table format (image created by author)

This feature allows you to query your data as it exists at any point. This will open up new possibilities for data and business analysts to understand trends and how the data evolved over time. You can effortlessly roll back to a previous state in case of any errors. This feature also facilitates auditing checks by allowing you to analyze the data at a specific point in time.

SQL

 

2. Schema Evolution

Apache Iceberg’s schema evolution allows changes to your schema without any huge effort or costly migrations. As your business needs evolve, you can:

  • Add and remove columns without any downtime or table rewrites. 
  • Update the column (widening).
  • Change the order of columns.
  • Rename an existing column.

These changes are handled at the metadata level without needing to rewrite the underlying data. 

SQL

 

3. Partition Evolution

Using the Apache Iceberg table format, you can change the table partitioning strategy without rewriting the underlying table or migrating the data to a new table. This is made possible as queries do not reference the partition values directly like in Apache Hadoop. Iceberg keeps metadata information for each partition version separately.  This makes it easy to get the splits while querying the data. For example, querying a table based on the date range, while the table was using the month as a partition column (before) as one split and day as a new partition column (after) as another split. This is called split planning. See the example below.

SQL

 

4. ACID Transactions 

Iceberg provides robust support for transactions in terms of Atomicity, Consistency, Isolation, and Durability (ACID). It allows multiple concurrent write operations, which will enable high throughput in heavy data-intensive jobs without compromising data consistency.

SQL

 

All operations in Iceberg are transactional, meaning the data remains consistent despite failures or modifications to the data concurrently.

SQL

 

It also supports different isolation levels, which allows you to balance performance and consistency based on the requirement.

SQL

 

Here is a summary showing how Iceberg handles row-level updates and deletes. 

Delete records process in Apache Iceberg

Figure 2: Delete records process in Apache Iceberg (image created by author)

5. Advanced Table Operations

Iceberg supports advanced table operations such as:

  • Creating/managing table snapshots: This gives the ability to have robust version control.
  • Fast query planning and execution with it’s highly optimized metadata
  • Built-in tools for table maintenance, such as compaction and orphan file cleanup 

Iceberg is designed to work with all major cloud storage, such as AWS S3, GCS, and Azure Blob Storage. Also, Iceberg integrates easily with data processing engines such as Spark, Presto, Trino, and Hive.

Final Thoughts

These highlighted features allow companies to build modern, flexible, scalable, and efficient data lakes, which can time travel, easily handle schema changes, support ACID transactions, and partition evolution.

Source:
https://dzone.com/articles/key-features-of-apache-iceberg-for-data-lakes