Next-Gen Data Center Networking - Built for AI, Powered by AI

AvidThink is glad to announce the availability of our most recent research brief, covering the impact on data center networking by new workloads, including Artificial Intelligence (AI) and Machine Learning (ML). If you’re ready to get your hands on a copy of “Next-Gen Data Center Networking – Built for AI, Powered by AI,” you can download your personal copy from our sponsor, Huawei, here. In the meantime, we’ll provide you with some highlights, though we’d still recommend you download and read through the 9 pages of content—it’s a quick read, and we also provide AvidThink’s review of Huawei’s AI Fabric solution family, which includes the recently launched CloudEngine 16800 switch supporting a massive 768 ports of 400GbE.

Rise of AI/ML and Storage Workloads

Next-generation data center networks have to evolve to accommodate new consumer and business application workloads. Video and media-rich content drive about three-quarters of Internet traffic, and while these large bandwidth streams will continue to dominate, there will be a massive number of lower bandwidth streams that carry critical transactional API calls from mobile and desktop applications and IoT devices.

At the same time, within the data center, a move to micro-services-based software architectures, distributed storage, and Artificial Intelligence/Machine Learning (AI/ML) workloads are pushing east-west traffic to previously unseen heights. This traffic will not abate anytime soon; IDC expects worldwide revenue for big data and business analytics solutions to reach $260 billion in 2022, growing at 11.9 percent annually. Moreover, two of the fastest categories in this are cognitive/AI software (36.5% CAGR) and non-relational analytic data stores (30.3% CAGR).

Meeting Data Center Networking Needs

To meet the above needs, data center networks must reach new levels of performance, scale, and visibility. Unsurprisingly, one of the means to support these new performance levels is to simply up the speed of the interconnectivity fabric. Today, 25GbE, 40GbE, and 100GbE switches make up the majority of data center connectivity. However, larger-scale 400GbE switches from major manufacturers are starting to hit the market. At the same time, server NICs are getting smarter with built-in co-processing in the form of FPGA, NPUs or ASICs, that can help offload the CPU from I/O traffic and drive faster connectivity speeds.

Further, portions of these fabrics are now run in lossless mode, and RDMA over Converged Ethernet (RoCE, RoCEv2) standards are favored by many data center operators. Compared to more specialized fabrics like Infiniband, converged Ethernet promises a standardized approach in the data center for switches, cabling, as well as operations, and maintenance infrastructure. Using RoCE and ROCEv2 can help improve the performance of storage and AI/ML workloads, in some cases dramatically reducing the training time needed for AI/ML models. This is recognized by most AI/ML libraries today, since a good number of them support RDMA in their libraries.

To make sure these fabrics can be run lossless without impacting overall throughput, new technologies for managing queuing and buffering are on the horizon, with some vendors turning to use optimization and AI/ML to assist in the tuning of switch resources to optimize throughput and manage latencies. This is a case where AI/ML can be used to facilitate AI/ML workloads.

Certainly, that’s not the only use of AI/ML, and the research brief goes into other important use cases in the data center—for more details, download your copy today.

Huawei’s AI Fabric Solution Review

As a value-add portion to our AI in Next-Gen Data Center Research Brief, we’ve also reviewed the AI Fabric solution from Huawei and provided AvidThink’s take on their capabilities. Huawei’s AI Fabric solutions include a range of data center switches from their CloudEngine family, the most recent of which was the CloudEngine 16800 (learn more), which was just announced at the time we completed the solution review.

In the face of continued political challenges in the western world, Huawei has continued innovating over the last year, shipping new products and pushing the envelope in technology. Their AI Fabric solution has been validated by EANTC to demonstrate performance improvements in HPC workloads (which are similar to AI/ML workloads) and for distributed storage, with 40% latency reduction in HPC tests and 25% improvement in IOPS for storage.

Key to Huawei’s innovations on these switches is their investment in a converged ethernet approach with ROCEv2/RDMA. By combining a high-speed fabric with more intelligent switch management, Huawei is betting that ethernet will be the foundation of future data centers powering AI/ML applications. Further, Huawei has demonstrated innovation in their iLossless family of algorithms, which include techniques such as virtual input queues, dynamic congestion waterline, and others we cover in the solution review. Combined, these improvements are what give AI Fabric their ability to accelerate AI/ML and distributed storage workloads.

Huawei has also started using its own AI/ML technologies to manage and tune these switches to improve both local switch and global switch cluster performance, increasing throughput and lowering latency end-to-end in large data centers. The solution review provides more insight into how Huawei’s innovations in AI/ML apply to optimization and troubleshooting.

Huawei’s approach to using AI/ML to accelerate data center networks for AI/ML and other workloads will be a significant benefit to network engineers everywhere, and their AI Fabric solutions coupled with their recently announced CloudEngine 16800 will make a big splash when rolled out in the field next quarter. To prepare for that, get a leg up on your peers by grabbing your copy of the combined research brief and solution review today.