|
| 1 | +/** |
| 2 | + * @title Using SIMD Instructions to Sum a Vector with RcppParallel |
| 3 | + * @author Kevin Ushey |
| 4 | + * @license GPL (>= 2) |
| 5 | + * @tags parallel |
| 6 | + * @summary Demonstrates how RcppParallel could be used to implement a SIMD-aware sum. |
| 7 | + */ |
| 8 | + |
| 9 | +/** |
| 10 | + * Newer releases of [RcppParallel](http://rcppcore.github.io/RcppParallel/) now |
| 11 | + * bundle the [Boost.SIMD](https://www.lri.fr/~falcou/pub/pact-2012.pdf) |
| 12 | + * library, providing a framework for ensuring that the algorithms you write |
| 13 | + * take advantage of the vectorized instructions offered by new CPUs. This |
| 14 | + * article illustrates how you might use `Boost.SIMD` to sum all of the elements |
| 15 | + * in a vector. |
| 16 | + * |
| 17 | + * First, let's review how we might use `std::accumulate()` to sum a vector of numbers. |
| 18 | + * We explicitly pass in the `std::plus<double>()` functor, just to make it clear that |
| 19 | + * the `std::accumulate()` algorithm expects a binary functor when accumulating values. |
| 20 | + */ |
| 21 | + |
| 22 | +#include <Rcpp.h> |
| 23 | +using namespace Rcpp; |
| 24 | + |
| 25 | +// [[Rcpp::export]] |
| 26 | +double vectorSum(NumericVector x) { |
| 27 | + return std::accumulate(x.begin(), x.end(), 0.0, std::plus<double>()); |
| 28 | +} |
| 29 | + |
| 30 | +/** |
| 31 | + * Now, let's rewrite this to take advantage of `Boost.SIMD`. There are |
| 32 | + * two main steps required to take advantage of `Boost.SIMD` at a high |
| 33 | + * level: |
| 34 | + * |
| 35 | + * 1. Write a functor, with a templated call operator, with the implementation |
| 36 | + * written in a '`Boost.SIMD`-aware' way; |
| 37 | + * |
| 38 | + * 2. Provide the functor as an argument to `boost::simd::transform()` or |
| 39 | + * `boost::simd::accumulate()`, as appropriate. |
| 40 | + * |
| 41 | + * Let's follow these steps in implementing our SIMD sum. |
| 42 | + */ |
| 43 | + |
| 44 | +// [[Rcpp::depends(RcppParallel)]] |
| 45 | +#include <RcppParallel.h> |
| 46 | + |
| 47 | +struct simd_plus { |
| 48 | + template <typename T> |
| 49 | + T operator()(const T& lhs, const T& rhs) { |
| 50 | + return lhs + rhs; |
| 51 | + } |
| 52 | +}; |
| 53 | + |
| 54 | +// [[Rcpp::export]] |
| 55 | +double vectorSumSimd(NumericVector x) { |
| 56 | + return boost::simd::accumulate(x.begin(), x.end(), 0.0, simd_plus()); |
| 57 | +} |
| 58 | + |
| 59 | +/** |
| 60 | + * As you can see, it's quite simple to take advantage of `Boost.SIMD`. |
| 61 | + * Now, let's compare the performance of these two implementations. |
| 62 | + */ |
| 63 | + |
| 64 | +/*** R |
| 65 | +library(microbenchmark) |
| 66 | +
|
| 67 | +# allocate a large vector |
| 68 | +set.seed(123) |
| 69 | +v <- rnorm(1E6) |
| 70 | +
|
| 71 | +# ensure they compute the same values |
| 72 | +stopifnot(all.equal(vectorSum(v), vectorSumSimd(v))) |
| 73 | +
|
| 74 | +# compare performance |
| 75 | +res <- microbenchmark(vectorSum(v), vectorSumSimd(v)) |
| 76 | +print(res) |
| 77 | +*/ |
| 78 | + |
| 79 | +/** |
| 80 | + * Perhaps surprisingly, the `Boost.SIMD` solution is much faster -- the gains |
| 81 | + * are similar to what we might have seen when computing the sum in parallel. |
| 82 | + * However, we're still just using a single core; we're just taking advantage of |
| 83 | + * vectorized instructions provided by the CPU. In this particular case, on |
| 84 | + * Intel CPUs, `Boost.SIMD` will ensure that we are using the `addpd` |
| 85 | + * instruction, which is documented in the Intel Software Developer's Manual |
| 86 | + * [[PDF](http://www.intel.com/Assets/en_US/PDF/manual/253666.pdf)]. |
| 87 | + * |
| 88 | + * Note that, for the naive serial sum, the compiler would likely generate |
| 89 | + * similarly efficient code when the `-ffast-math` optimization flag is set. |
| 90 | + * By default, the compiler is somewhat 'pessimistic' about the set of |
| 91 | + * optimizations it can perform around floating point arithmetic. This is |
| 92 | + * because it must respect the [IEEE floating |
| 93 | + * point](https://en.wikipedia.org/wiki/IEEE_floating_point) standard, and |
| 94 | + * this means respecting the fact that, for example, floating point |
| 95 | + * operations are not assocative: |
| 96 | + */ |
| 97 | + |
| 98 | +/*** R |
| 99 | +((0.1 + 0.2) + 0.3) - (0.1 + (0.2 + 0.3)) |
| 100 | +*/ |
| 101 | + |
| 102 | +/** |
| 103 | + * Surprisingly, the above computation does not evaluate to zero! |
| 104 | + * |
| 105 | + * In practice, you're likely safe to take advantage of the `-ffast-math` |
| 106 | + * optimizations, or Boost.SIMD, in your own work. However, be sure to test and |
| 107 | + * verify! |
| 108 | + */ |
0 commit comments