POWER Vector Library Manual
1.0.4
|
Header package containing a collection of 128-bit SIMD operations over 16-bit integer elements. More...
#include <pveclib/vec_char_ppc.h>
Go to the source code of this file.
Functions | |
static vui16_t | vec_absduh (vui16_t vra, vui16_t vrb) |
Vector Absolute Difference Unsigned halfword. More... | |
static vui16_t | vec_clzh (vui16_t vra) |
Vector Count Leading Zeros Halfword for unsigned short elements. More... | |
static vui16_t | vec_ctzh (vui16_t vra) |
Vector Count Trailing Zeros Halfword for unsigned short elements. More... | |
static vui16_t | vec_mrgahh (vui32_t vra, vui32_t vrb) |
Vector Merge Algebraic High Halfword operation. More... | |
static vui16_t | vec_mrgalh (vui32_t vra, vui32_t vrb) |
Vector Merge Algebraic Low Halfword operation. More... | |
static vui16_t | vec_mrgeh (vui16_t vra, vui16_t vrb) |
Vector Merge Even Halfwords operation. More... | |
static vui16_t | vec_mrgoh (vui16_t vra, vui16_t vrb) |
Vector Merge Odd Halfwords operation. More... | |
static vi16_t | vec_mulhsh (vi16_t vra, vi16_t vrb) |
Vector Multiply High Signed halfword. More... | |
static vui16_t | vec_mulhuh (vui16_t vra, vui16_t vrb) |
Vector Multiply High Unsigned halfword. More... | |
static vui16_t | vec_muluhm (vui16_t vra, vui16_t vrb) |
Vector Multiply Unsigned halfword Modulo. More... | |
static vui16_t | vec_popcnth (vui16_t vra) |
Vector Population Count halfword. More... | |
static vui16_t | vec_revbh (vui16_t vra) |
byte reverse each halfword of a vector unsigned short. More... | |
static vb16_t | vec_setb_sh (vi16_t vra) |
Vector Set Bool from Signed Halfword. More... | |
static vui16_t | vec_slhi (vui16_t vra, const unsigned int shb) |
Vector Shift left Halfword Immediate. More... | |
static vui16_t | vec_srhi (vui16_t vra, const unsigned int shb) |
Vector Shift Right Halfword Immediate. More... | |
static vi16_t | vec_srahi (vi16_t vra, const unsigned int shb) |
Vector Shift Right Algebraic Halfword Immediate. More... | |
static vui32_t | vec_vmaddeuh (vui16_t a, vui16_t b, vui16_t c) |
Vector Multiply-Add Even Unsigned Halfwords. More... | |
static vui32_t | vec_vmaddouh (vui16_t a, vui16_t b, vui16_t c) |
Vector Multiply-Add Odd Unsigned Halfwords. More... | |
static vui16_t | vec_vmrgeh (vui16_t vra, vui16_t vrb) |
Vector Merge Even Halfwords. More... | |
static vui16_t | vec_vmrgoh (vui16_t vra, vui16_t vrb) |
Vector Merge Odd Halfwords. More... | |
Header package containing a collection of 128-bit SIMD operations over 16-bit integer elements.
Most of these operations are implemented in a single instruction on newer (POWER6/POWER7POWER8/POWER9) processors. This header serves to fill in functional gaps for older (POWER7, POWER8) processors and provides a in-line assembler implementation for older compilers that do not provide the build-ins.
Most vector short (16-bit integer halfword) operations are implemented with PowerISA VMX instructions either defined by the original VMX (AKA Altivec) or added to later versions of the PowerISA. PowerISA 2.07B (POWER8) added several useful halfword operations (count leading zeros, population count) not included in the original VMX. PowerISA 3.0B (POWER9) adds several more (absolute difference, compare not equal, count trailing zeros, extend sign, extract/insert, and reverse bytes). Most of these intrinsic (compiler built-ins) operations are defined in <altivec.h> and described in the compiler documentation.
This header covers operations that are either:
Added vec_vmaddeuh() and vec_vmaddouh() as an optimization for the vector multiply quadword implementations on POWER7.
It would be useful to provide a vector multiply high halfword (return the high order 16-bits of the 32-bit product) operation. This can be used for multiplicative inverse (effectively integer divide) operations. Neither integer multiply high nor divide are available as vector instructions. However the multiply high halfword operation can be composed from the existing multiply even/odd halfword operations followed by the vector merge even halfword operation. Similarly a multiply low (modulo) halfword operation can be composed from the existing multiply even/odd halfword operations followed by the vector merge odd halfword operation.
As a prerequisite we need to provide the merge even/odd halfword operations. While PowerISA has added these operations for word and doubleword, instructions are nor defined for byte and halfword. Fortunately vector merge operations are just a special case of vector permute. So the vec_vmrgoh() and vec_vmrgeh() implementation can use vec_perm and appropriate selection vectors to provide these merge operations.
But this is complicated by little-endian (LE) support as specified in the OpenPOWER ABI and as implemented in the compilers. Little-endian changes the effective vector element numbering and the location of even and odd elements. This means that the vector built-ins provided by altivec.h may not generate the instructions you would expect.
The OpenPOWER ABI provides a helpful table of Endian Sensitive Operations. For for vec_mule (vmuleuh, vmulesh):
Replace with vmulouh and so on, for LE.
For for vec_mulo (vmulouh, vmulosh):
Replace with vmuleuh and so on, for LE.
Also for vec_perm (vperm) it specifies:
For LE, Swap input arguments and complement the selection vector.
The above is just a sampling of a larger list of Endian Sensitive Operations.
The obvious coding for Vector Multiply High Halfword would be:
A couple problems with this:
The first step is to implement Vector Merge Even Halfword operation:
For big-endian we have a straight forward vec_perm with a permute select vector interleaving even halfwords from vectors vra and vrb.
For little-endian we need to nullify the LE transform applied by the compiler. So the select vector looks like it interleaves odd halfwords from vectors vrb and vra. It also reverses byte numbering within halfwords. The compiler transforms this back into the operation we wanted in the first place. The result is not endian sensitive and is stable across BE/LE implementations. Similarly for the Vector Merge Odd Halfword operation.
As good OpenPOWER ABI citizens we should also provide endian sensitive operations vec_mrgeh() vec_mrgoh(). For example:
Also to follow that pattern established for vec_int32_ppc.h we should provide implementations for Vector Merge Algebraic High/Low Halfword. For example:
This is simpler as we can use the endian invariant vec_vmrgeh() operation. Similarly for Vector Merge Algebraic Low Halfword using vec_vmrgoh().
Now we have all the parts we need to implement multiply high/low halfword. For example Multiply High Unsigned Halfword:
Similarly for Multiply High Signed Halfword.
Finally we can implement the Multiply Low Halfword which by PowerISA conventions is called Multiply Unsigned Halfword Modulo:
So what does the compiler generate after unwinding three levels of inline functions. For this test case:
The GCC 8 compiler targeting powerpc64le and -mcpu=power8 generates:
The addis, addi, lvx instruction sequence loads the permute selection constant vector. The xxlnor instruction complements the selection vector for LE. These instructions are only needed once per function and can be hoisted out of loops and shared across instances of vec_mulhuh(). Which might look like this:
The vmulouh, vmuleuh, vperm instruction sequence is the core of the function. They multiply the elements and selects/merges the high order 16-bits of each product into the result vector.
Suppose we have a requirement to convert an array of 16-bit unsigned short values to decimal. The classic itoa implementation performs a sequence of divide / modulo by 10 operations that produce one (decimal) value per iteration, until the divide returns 0.
For this example we want to vectorize the operation and the PowerISA (and most other platforms) does not provide a vector integer divide instruction. But we do have vector integer multiply. As we will see the multiply high defined above is very handy for applying the multiplicative inverse. Also, the conversion divide is a constant value applied across the vector which simplifies the coding.
Here we can use the multiplicative inverse which is a scaled fixed point fraction calculated from the original divisor. This works nicely if the fixed radix point is just before the 16-bit fraction and we have a multiply high (vec_mulhuh()) operation. Multiplying a 16-bit unsigned integer by a 16-bit unsigned fraction generates a 32-bit product with 16-bits above (integer) and below (fraction) the radix point. The high 16-bits of the product is a good approximation of the integer quotient.
It turns out that generating the multiplicative inverse can be tricky. To produce correct results over the full range, requires possible pre-scaling and post-shifting, and sometimes a corrective addition. Fortunately, the mathematics are well understood and are commonly used in optimizing compilers. Even better, Henry Warren's book has a whole chapter on this topic.
In the chapter above;
Figure 10-2 Computing the magic number for unsigned division.
provides a sample C function for generating the magic number (actually a struct containing; the magic multiplicative inverse, "add" indicator, and the shift amount). For the 16-bit unsigned divisor 10, this is { 52429, 0, 3 }:
Which could look like this:
But we also need the modulo to extract each digit. The simplest and oldest technique is to multiply the quotient by the divisor (constant 10) and subtract that from the original dividend. Here we can use the vec_muluhm() operation we defined above. Which could look like this:
As we mentioned above, some divisors require an add before the shift as a correction. For the 16-bit unsigned divisor 10000 this is { 41839, 1, 14 }:
In this case the perfect multiplier is too large (>= 2**16). So the magic multiplier is reduced by 2**16 and to correct for this we need to add the dividend to the product. This add may generate a carry that must be included in the shift. Here vec_avg handles the 17-bit sum internally before shifting right 1. But vec_avg adds an extra +1 (for rounding) that we don't want. So we use (n-1) for the product correction then complete the operation with shift right (s-1). Which could look like this:
The modulo computation remains the same as Divide by constant 10 examples.
We can use the example above (see Multiply High Unsigned Halfword Example) to illustrate the performance metrics pveclib provides. For vec_mulhuh() the core operation is the sequence vmulouh/vmuleuh/vperm. This represents the best case latency, when it is used multiple times in a single larger function.
The compiler notes that vmulouh/vmuleuh are independent instructions that can execute concurrently (in separate vector pipelines). The compiler schedules them to issue in same cycle. The latency for vmulouh/vmuleuh is listed as 7 cycle and the throughput of 2 per cycle (there are 2 vector pipes for multiply). As we assume this function will use both vector pipelines, the throughput for this function is reduced to 1 per cycle.
We still need to select/merge the results. The vperm instruction is dependent on the execution of both vmulouh/vmuleuh and load of the select vector complete. For this case we assume that the load of the permute select vector has already executed. The processor can not issue the vperm until both vmulouh/vmuleuh instructions execute. The latency for vperm is 2 cycles (3 on POWER9). So the best case latency for this operation is is (7 + 2 = 9) cycles (10 on POWER9).
Looking at the first or only execution of vec_mulhuh() in a function defines the worse case latency. Here we have to include the permute select vector load and (for LE) the select vector complement. However this case provides additional multiple pipe parallelism that needs to be accounted for in the latencies.
The compiler notes that addis/vmulouh/vmuleuh are independent instructions that can execute concurrently in separate pipelines. So the compiler schedules them to issue in same cycle. The latency for vmulouh/vmuleuh is 7 cycles while the addis latency is only 2 cycles. The dependent addi instruction can issue in the 3rd cycle, while vmulouh/vmuleuh are still executing. The addi also has a 2 cycle latency, so the dependent lvx can issue in the 5th cycle, while vmulouh/vmuleuh are still executing. The lvx has a latency of 5 cycles and will not complete execution until 2 cycles after vmulouh/vmuleuh. The dependent xxlnor is waiting of the load (lvx) and has a latency of 2 cycles.
So there are two independent instruction sequences; vmulouh/vmuleuh and addis/addi/lvx/xxlnor. Both must complete execution before the vperm can issue and complete the operation. The later sequence has the longer (2+2+5+2=11) latency and dominates the timing. So the worst latency for the full sequence is (2+2+5+2+2 = 13) cycles (14 on POWER9).
processor | Latency | Throughput |
---|---|---|
power8 | 9-13 | 1/cycle |
power9 | 10-14 | 1/cycle |
High level performance estimates are provided as an aid to function selection when evaluating algorithms. For background on how Latency and Throughput are derived see: Performance data.
Vector Absolute Difference Unsigned halfword.
Compute the absolute difference for each halfword. For each unsigned halfword, subtract VRB[i] from VRA[i] and return the absolute value of the difference.
processor | Latency | Throughput |
---|---|---|
power8 | 4 | 1/cycle |
power9 | 3 | 2/cycle |
vra | vector of 8 x unsigned halfword |
vrb | vector of 8 x unsigned halfword |
Vector Count Leading Zeros Halfword for unsigned short elements.
Count the number of leading '0' bits (0-16) within each halfword element of a 128-bit vector.
For POWER8 (PowerISA 2.07B) or later use the Vector Count Leading Zeros Halfword instruction vclzh. Otherwise use sequence of pre 2.07 VMX instructions.
processor | Latency | Throughput |
---|---|---|
power8 | 2 | 2/cycle |
power9 | 3 | 2/cycle |
vra | 128-bit vector treated as 8 x 16-bit unsigned integer (halfword) elements. |
Vector Count Trailing Zeros Halfword for unsigned short elements.
Count the number of trailing '0' bits (0-16) within each halfword element of a 128-bit vector.
For POWER9 (PowerISA 3.0B) or later use the Vector Count Trailing Zeros Halfword instruction vctzh. Otherwise use a sequence of pre ISA 3.0 VMX instructions. SIMDized count trailing zeros inspired by:
Warren, Henry S. Jr and Hacker's Delight, 2nd Edition, Addison Wesley, 2013. Chapter 5 Counting Bits, Section 5-4.
processor | Latency | Throughput |
---|---|---|
power8 | 6-8 | 2/cycle |
power9 | 3 | 2/cycle |
vra | 128-bit vector treated as 8 x 16-bit unsigned short integer (halfwords) elements. |
Vector Merge Algebraic High Halfword operation.
Merge only the high halfwords from 8 x Algebraic words across vectors vra and vrb. This is effectively the Vector Merge Even Halfword operation that is not modified for endian.
For example merge the high 16-bits from each of 8 x 32-bit products as generated by vec_muleuh/vec_mulouh. This result is effectively a vector multiply high unsigned halfword.
processor | Latency | Throughput |
---|---|---|
power8 | 2-13 | 2/cycle |
power9 | 3-14 | 2/cycle |
vra | 128-bit vector unsigned int. |
vrb | 128-bit vector unsigned int. |
Vector Merge Algebraic Low Halfword operation.
Merge only the low halfwords from 8 x Algebraic words across vectors vra and vrb. This is effectively the Vector Merge Odd Halfword operation that is not modified for endian.
For example merge the low 16-bits from each of 8 x 32-bit products as generated by vec_muleuh/vec_mulouh. This result is effectively a vector multiply low unsigned halfword.
processor | Latency | Throughput |
---|---|---|
power8 | 2-13 | 2/cycle |
power9 | 3-14 | 2/cycle |
vra | 128-bit vector unsigned int. |
vrb | 128-bit vector unsigned int. |
Vector Merge Even Halfwords operation.
Merge the even halfword elements from the concatenation of 2 x vectors (vra and vrb).
processor | Latency | Throughput |
---|---|---|
power8 | 2-13 | 2/cycle |
power9 | 3-14 | 2/cycle |
vra | 128-bit vector unsigned short. |
vrb | 128-bit vector unsigned short. |
Vector Merge Odd Halfwords operation.
Merge the odd halfword elements from the concatenation of 2 x vectors (vra and vrb).
processor | Latency | Throughput |
---|---|---|
power8 | 2-13 | 2/cycle |
power9 | 3-14 | 2/cycle |
vra | 128-bit vector unsigned short. |
vrb | 128-bit vector unsigned short. |
Vector Multiply High Signed halfword.
Multiple the corresponding halfword elements of two vector signed short values and return the high order 16-bits, for each 32-bit product element.
processor | Latency | Throughput |
---|---|---|
power8 | 9-13 | 1/cycle |
power9 | 10-14 | 1/cycle |
vra | 128-bit vector signed short. |
vrb | 128-bit vector signed short. |
Vector Multiply High Unsigned halfword.
Multiply the corresponding halfword elements of two vector unsigned short values and return the high order 16-bits, for each 32-bit product element.
processor | Latency | Throughput |
---|---|---|
power8 | 9-13 | 1/cycle |
power9 | 10-14 | 1/cycle |
vra | 128-bit vector unsigned short. |
vrb | 128-bit vector unsigned short. |
Vector Multiply Unsigned halfword Modulo.
Multiply the corresponding halfword elements of two vector unsigned short values and return the low order 16-bits of the 32-bit product for each element.
processor | Latency | Throughput |
---|---|---|
power8 | 9-13 | 1/cycle |
power9 | 10-14 | 1/cycle |
vra | 128-bit vector unsigned short. |
vrb | 128-bit vector unsigned short. |
Vector Population Count halfword.
Count the number of '1' bits (0-16) within each byte element of a 128-bit vector.
For POWER8 (PowerISA 2.07B) or later use the Vector Population Count Halfword instruction. Otherwise use simple Vector (VMX) instructions to count bits in bytes in parallel.
processor | Latency | Throughput |
---|---|---|
power8 | 2 | 2/cycle |
power9 | 3 | 2/cycle |
vra | 128-bit vector treated as 8 x 16-bit integers (halfword) elements. |
byte reverse each halfword of a vector unsigned short.
For each halfword of the input vector, reverse the order of bytes / octets within the halfword.
processor | Latency | Throughput |
---|---|---|
power8 | 2-11 | 2/cycle |
power9 | 3 | 2/cycle |
vra | a 128-bit vector unsigned short. |
Vector Set Bool from Signed Halfword.
For each halfword, propagate the sign bit to all 16-bits of that halfword. The result is vector bool short reflecting the sign bit of each 16-bit halfword.
processor | Latency | Throughput |
---|---|---|
power8 | 2-4 | 2/cycle |
power9 | 2-5 | 2/cycle |
vra | Vector signed short. |
Vector Shift left Halfword Immediate.
Shift left each halfword element [0-7], 0-15 bits, as specified by an immediate value. The shift amount is a const unsigned int in the range 0-15. A shift count of 0 returns the original value of vra. Shift counts greater then 15 bits return zero.
processor | Latency | Throughput |
---|---|---|
power8 | 4-11 | 2/cycle |
power9 | 5-11 | 2/cycle |
vra | a 128-bit vector treated as a vector unsigned short. |
shb | Shift amount in the range 0-15. |
Vector Shift Right Algebraic Halfword Immediate.
Shift right algebraic each halfword element [0-7], 0-15 bits, as specified by an immediate value. The shift amount is a const unsigned int in the range 0-15. A shift count of 0 returns the original value of vra. Shift counts greater then 7 bits return the sign bit propagated to each bit of each element.
processor | Latency | Throughput |
---|---|---|
power8 | 4-11 | 2/cycle |
power9 | 5-11 | 2/cycle |
vra | a 128-bit vector treated as a vector signed char. |
shb | Shift amount in the range 0-7. |
Vector Shift Right Halfword Immediate.
Shift right each halfword element [0-7], 0-15 bits, as specified by an immediate value. The shift amount is a const unsigned int in the range 0-15. A shift count of 0 returns the original value of vra. Shift counts greater then 15 bits return zero.
processor | Latency | Throughput |
---|---|---|
power8 | 4-11 | 2/cycle |
power9 | 5-11 | 2/cycle |
vra | a 128-bit vector treated as a vector unsigned short. |
shb | Shift amount in the range 0-15. |
Vector Multiply-Add Even Unsigned Halfwords.
Multiply the even 16-bit Words of vector unsigned short values (a * b) and return sums of the unsigned 32-bit product and the even 16-bit halfwords of c (aeven * beven) + EXTZ(ceven).
processor | Latency | Throughput |
---|---|---|
power8 | 9-18 | 2/cycle |
power9 | 9-16 | 2/cycle |
a | 128-bit vector unsigned short. |
b | 128-bit vector unsigned short. |
c | 128-bit vector unsigned short. |
Vector Multiply-Add Odd Unsigned Halfwords.
Multiply the odd 16-bit Halfwords of vector unsigned short values (a * b) and return sums of the unsigned 32-bit product and the odd 16-bit halfwords of c (aodd * bodd) + EXTZ(codd).
processor | Latency | Throughput |
---|---|---|
power8 | 9-18 | 2/cycle |
power9 | 9-16 | 2/cycle |
a | 128-bit vector unsigned short. |
b | 128-bit vector unsigned short. |
c | 128-bit vector unsigned short. |
Vector Merge Even Halfwords.
Merge the even halfword elements from the concatenation of 2 x vectors (vra and vrb).
processor | Latency | Throughput |
---|---|---|
power8 | 2-13 | 2/cycle |
power9 | 3-14 | 2/cycle |
vra | 128-bit vector unsigned short. |
vrb | 128-bit vector unsigned short. |
Vector Merge Odd Halfwords.
Merge the odd halfword elements from the concatenation of 2 x vectors (vra and vrb).
processor | Latency | Throughput |
---|---|---|
power8 | 2-13 | 2/cycle |
power9 | 3-14 | 2/cycle |
vra | 128-bit vector unsigned short. |
vrb | 128-bit vector unsigned short. |