CDH to Azure HDInsights : Migration

Step-by-Step Guide to Migrate CDH Cluster to Azure HDInsights.

Posted by Aravind Nuthalapati on April 11, 2020

This Article guides you on how to migrate CDH to Azure HDInsight.

Best Practices for Migrating CDH to Azure HDInsight

1. Assess the Existing CDH Cluster

Before migration, perform a detailed assessment of your CDH cluster:

  • Identify services in use (HDFS, Hive, Spark, HBase, Kafka, etc.).
  • Check cluster size, resource consumption, and data storage patterns.
  • Analyze job dependencies, data lineage, and security policies.
  • Evaluate third-party integrations and custom configurations.

2. Choose the Right Azure HDInsight Cluster Type

HDInsight offers different cluster types, choose the best fit:

  • Apache Spark: For big data analytics workloads.
  • Apache Hadoop: For batch processing with Hive & MapReduce.
  • Apache HBase: For NoSQL real-time workloads.
  • Apache Kafka: For real-time data streaming.
  • Interactive Query (LLAP): For fast SQL analytics on Hive.

3. Plan HDFS to Azure Storage Migration

Migrate data from HDFS to Azure Data Lake Storage (ADLS) or Azure Blob Storage:

hdfs dfs -copyToLocal /hdfs-data /local-data
az storage blob upload-batch -d my-container --account-name mystorageaccount -s /local-data

Best Practices:

  • Use DistCp for large-scale HDFS data migration.
  • Leverage Azure Data Box for petabyte-scale data transfers.
  • Enable ADLS Gen2 hierarchical namespace for better performance.

4. Migrate Hive Metastore

HDInsight uses an external Hive Metastore (Azure Database for MySQL/PostgreSQL).

  • Export CDH Hive Metastore:
mysqldump -u root -p --databases metastore > metastore_backup.sql
  • Import into Azure Database for MySQL:
mysql -h azure-mysql-server -u admin -p < metastore_backup.sql
  • Configure HDInsight to use the external Metastore:
hive.metastore.uris=thrift://azure-metastore:9083

5. Reconfigure and Migrate Hive Queries

Adjust Hive queries for HDInsight compatibility:

  • Convert Hive-managed tables to external tables pointing to ADLS/Blob Storage.
  • Update queries to use optimized formats like ORC and Parquet.
  • Test queries with HDInsight Interactive Query (LLAP) for better performance.

6. Migrate HBase to Azure HDInsight HBase

Export CDH HBase tables:

hbase snapshot export -snapshot myTable -copy-to hdfs:///hbase-backup

Import into HDInsight HBase:

hbase restore_snapshot myTable

7. Migrate Apache Kafka

Use Kafka MirrorMaker for cross-cluster replication:

bin/kafka-mirror-maker.sh --consumer.config source-cluster.properties \
 --producer.config target-cluster.properties --whitelist ".*"
  • Ensure partition alignment between CDH Kafka and HDInsight Kafka.
  • Verify Kafka security settings (SASL, SSL) for authentication.

8. Migrate Apache Spark Jobs

Convert CDH Spark jobs to run on HDInsight Spark:

  • Replace HDFS paths with ADLS paths in job configurations.
  • Adjust spark-defaults.conf settings to match Azure resources.
  • Optimize execution by leveraging Azure Synapse for Spark SQL workloads.

9. Implement Security & Access Control

  • Enable Azure AD authentication for HDInsight.
  • Use Ranger or Azure Role-Based Access Control (RBAC) for fine-grained access.
  • Encrypt data at rest using Azure Storage encryption.

10. Validate Migration and Optimize Performance

  • Run test workloads to compare performance between CDH and HDInsight.
  • Enable autoscaling for optimized cluster resource allocation.
  • Monitor performance using Azure Monitor and HDInsight Metrics.

Summary

Migrating from CDH to Azure HDInsight involves careful planning, data migration, security configuration, and performance tuning. Following these best practices ensures a seamless transition while leveraging Azure’s scalability and managed services.