


NCCL uses a simple C API, which can be easily accessed fromĪ variety of programming languages.

Next to performance, ease of programming was the primary consideration in the design of NCCL. NCCL also automatically patterns its communication strategy to match the system’s It supports a variety of interconnect technologies Multiple GPUs both within and across nodes. NCCL conveniently removes the need for developers to optimize theirĪpplications for specific machines. ThisĪllows for fast synchronization and minimizes the resources needed to reach peak NCCL, on the other hand, implements eachĬollective in a single kernel handling both communication and computation operations. Through a combination of CUDA memory copy operations and CUDA CUDA ® based collectives would traditionally be realized Tight synchronization between communicating processors is a key aspect of collectiveĬommunication. Library focused on accelerating collective communication primitives. NCCL is not a full-blown parallel programming framework rather, it is a Collective communication algorithms employ many processors working in concert to aggregateĭata.
