Hadoop
-
Migrating Hadoop to the Cloud: 2X Storage Capacity and Fewer Ops Costs
Yimian is a leading AI-powered data analytics provider specializing in digital commerce data. We offer real-time insights on business strategy, product development, and digital commerce operations. Many of our customers are industry leaders in personal care, makeup, F&B, pet, and auto, like Procter and Gamble, Unilever, and Mars. Our original technology architecture was a big data cluster built using CDH (Cloudera Distributed Hadoop) in an on-premises data center. As our business grew, the data volume increased dramatically. To address challenges…
-
How To Use Change Data Capture With Apache Kafka and ScyllaDB
In this hands-on lab from ScyllaDB University, you will learn how to use the ScyllaDB CDC source connector to push the row-level changes events in the tables of a ScyllaDB cluster to a Kafka server. What Is ScyllaDB CDC? To recap, Change Data Capture (CDC) is a feature that allows you to not only query the current state of a database’s table but also query the history of all changes made to the table. CDC is production-ready (GA) starting from…
-
From Hadoop to Cloud: Why and How to Decouple Storage and Compute in Big Data Platforms
The advent of Apache Hadoop Distributed File System (HDFS) revolutionized the storage, processing, and analysis of data for enterprises, accelerating the growth of big data and bringing about transformative changes to the industry. Initially, Hadoop integrated storage and compute, but the emergence of cloud computing led to a separation of these components. Object storage emerged as an alternative to HDFS but had limitations. To complement these limitations, JuiceFS, an open-source, high-performance distributed file system, offers cost-effective solutions for data-intensive scenarios…
-
Building a Data Warehouse for Traditional Industry
This is a part of the digital transformation of a real estate giant. For the sake of confidentiality, I’m not going to reveal any business data, but you’ll get a detailed view of our data warehouse and our optimization strategies. Now let’s get started. Architecture Logically, our data architecture can be divided into four parts. Data integration: This is supported by Flink CDC, DataX, and the Multi-Catalog feature of Apache Doris. Data management: We use Apache Dolphinscheduler for script lifecycle…
-
Get Started With Trino and Alluxio in Five Minutes
Trino is an open-source distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Trino was designed to handle data warehousing, ETL, and interactive analytics by large amounts of data and producing reports. Alluxio is an open-source data orchestration platform for large-scale analytics and AI. Alluxio sits between compute frameworks such as Trino and Apache Spark and various storage systems like Amazon S3, Google Cloud Storage, HDFS, and MinIO. This is a…
-
Stateful Stream Processing With Memphis and Apache Spark
Amazon Simple Storage Service (S3) is a highly scalable, durable, and secure object storage service offered by Amazon Web Services (AWS). S3 allows businesses to store and retrieve any amount of data from anywhere on the web by making use of its enterprise-level services. S3 is designed to be highly interoperable and integrates seamlessly with other Amazon Web Services (AWS) and third-party tools and technologies to process data stored in Amazon S3. One of which is Amazon EMR (Elastic MapReduce)…