Payload should be replaced by something like:
struct MPI_Message {
sham::DeviceBuffer<u8> & bytebuf; // maybe use a host/device variant
u64 offset, length;
i32 rank_sender, rank_receiver;
}
where bytebuf is shared by multiple messages. Additionally the direct GPU toggle should be handled manually where the allocation is swapped by a host one last minute if gpu not avail.