In a recent episode of OnPrem Pros by xByte Technologies, Nicholas Coker and Justin Payne from Smart Communications discuss why they shifted from AWS to a self-hosted infrastructure using Proxmox and Ceph. This move was driven by rising storage costs and wanting greater control over their systems. View Episode
The Catalyst for Change
Smart Communications experienced a significant increase in their AWS S3 storage costs, which rose from $3,000 to $30,000 monthly. This surge prompted the team to research more cost-effective and controllable solutions. They opted for Proxmox for virtualization and Ceph for storage, aiming to bring their systems in-house and leverage open-source technologies for better cost management and system control.
Switching to Proxmox and Ceph
While open-source solutions often face skepticism regarding reliability and support, Smart Communications found Proxmox to be straightforward to deploy and manage, even across an 18-node cluster. The flexibility and customization offered by open-source platforms like Proxmox and Ceph provided them with unparalleled control over their infrastructure. However, integrating Ceph introduced complexities, particularly concerning networking and fiber configurations.
Navigating Network Complexities
Each node in their setup had six connections: four fiber links dedicated to Ceph and two copper links for Proxmox. The sensitivity of fiber optics to bending and contamination led to persistent connection issues. Additionally, Ceph's requirement for meticulous planning regarding redundancy and monitor distribution added layers of complexity, especially in an 18-node environment.
Lessons from the Trenches
Reflecting on their experience, the team acknowledged areas for improvement:
- Hardware Choices: They would opt for newer servers, such as Dell R640s, instead of a mix of older PowerEdge models.
- Fiber Usage: The overreliance on fiber optics introduced avoidable challenges.
- Disk I/O Bottlenecks: They identified disk I/O as a limiting factor, highlighting the need for more SSDs and NVMe drives.
- Ceph Stability: They learned the importance of allowing Ceph to stabilize during outages before manual intervention.
Testing Resilience Through Simulated Failures
To assess the robustness of their setup, the team conducted real-world tests by simulating failures, such as disconnecting power and network cables. These exercises revealed both strengths and weaknesses in their infrastructure. Early missteps, like manually adding monitors, provided valuable learning experiences. Community resources, including forums and updated documentation, were instrumental in troubleshooting and refining their systems.
Backup Strategies and Storage Configurations
For backups, Smart Communications utilized Proxmox Backup Server for its seamless integration. They implemented replication for most of their virtual machines to ensure uptime. Ceph storage pools were tailored to balance performance and capacity, creating erasure coding and replication strategies. They chose to run Ceph within Proxmox rather than as a standalone setup to maintain unified visibility and management.
Reflecting on the Transition
Despite the challenges encountered, the team expressed satisfaction with their decision to transition to Proxmox and Ceph. They emphasized the reliability and recoverability of their data and acknowledged that, with different hardware choices, they would make the same decision again. They also underscored the importance of thoroughly understanding I/O and storage strategies from the outset and encouraged others to test, break, and learn to fully grasp the capabilities of their infrastructure stack.
For more insights into Smart Communications' transition and experiences, watch the full discussion here: