Hadoop

Migrating Hadoop to the Cloud: 2X Storage Capacity and Fewer Ops Costs

Yimian is a leading AI-powered data analytics provider specializing in digital commerce data. We offer real-time insights on business strategy, product development, and digital commerce operations. Many of our customers are industry leaders in personal care, makeup, F&B, pet, and auto, like Procter and Gamble, Unilever, and Mars. Our original technology architecture was a big data cluster built using CDH (Cloudera Distributed Hadoop) in an on-premises data center. As our business grew, the data volume increased dramatically. To address challenges…

July 14, 2024

Tutorials
How To Use Change Data Capture With Apache Kafka and ScyllaDB

In this hands-on lab from ScyllaDB University, you will learn how to use the ScyllaDB CDC source connector to push the row-level changes events in the tables of a ScyllaDB cluster to a Kafka server. What Is ScyllaDB CDC? To recap, Change Data Capture (CDC) is a feature that allows you to not only query the current state of a database’s table but also query the history of all changes made to the table. CDC is production-ready (GA) starting from…

July 14, 2024

Tutorials
From Hadoop to Cloud: Why and How to Decouple Storage and Compute in Big Data Platforms

The advent of Apache Hadoop Distributed File System (HDFS) revolutionized the storage, processing, and analysis of data for enterprises, accelerating the growth of big data and bringing about transformative changes to the industry. Initially, Hadoop integrated storage and compute, but the emergence of cloud computing led to a separation of these components. Object storage emerged as an alternative to HDFS but had limitations. To complement these limitations, JuiceFS, an open-source, high-performance distributed file system, offers cost-effective solutions for data-intensive scenarios…

July 12, 2024

Tutorials
Building a Data Warehouse for Traditional Industry

This is a part of the digital transformation of a real estate giant. For the sake of confidentiality, I’m not going to reveal any business data, but you’ll get a detailed view of our data warehouse and our optimization strategies. Now let’s get started. Architecture Logically, our data architecture can be divided into four parts. Data integration: This is supported by Flink CDC, DataX, and the Multi-Catalog feature of Apache Doris. Data management: We use Apache Dolphinscheduler for script lifecycle…

July 11, 2024

Tutorials
Get Started With Trino and Alluxio in Five Minutes

Trino is an open-source distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Trino was designed to handle data warehousing, ETL, and interactive analytics by large amounts of data and producing reports. Alluxio is an open-source data orchestration platform for large-scale analytics and AI. Alluxio sits between compute frameworks such as Trino and Apache Spark and various storage systems like Amazon S3, Google Cloud Storage, HDFS, and MinIO. This is a…

July 11, 2024

Tutorials
Stateful Stream Processing With Memphis and Apache Spark

Amazon Simple Storage Service (S3) is a highly scalable, durable, and secure object storage service offered by Amazon Web Services (AWS). S3 allows businesses to store and retrieve any amount of data from anywhere on the web by making use of its enterprise-level services. S3 is designed to be highly interoperable and integrates seamlessly with other Amazon Web Services (AWS) and third-party tools and technologies to process data stored in Amazon S3. One of which is Amazon EMR (Elastic MapReduce)…

June 30, 2024

Tutorials