POWER Vector Library Manual
1.0.4
|
Header package containing a collection of 128-bit SIMD operations over 8-bit integer (char) elements. More...
#include <pveclib/vec_common_ppc.h>
Go to the source code of this file.
Functions | |
static vui8_t | vec_absdub (vui8_t vra, vui8_t vrb) |
Vector Absolute Difference Unsigned byte. More... | |
static vui8_t | vec_clzb (vui8_t vra) |
Vector Count Leading Zeros Byte for a unsigned char (byte) elements. More... | |
static vui8_t | vec_ctzb (vui8_t vra) |
Vector Count Trailing Zeros Byte for a unsigned char (byte) elements. More... | |
static vui8_t | vec_isalnum (vui8_t vec_str) |
Vector isalpha. More... | |
static vui8_t | vec_isalpha (vui8_t vec_str) |
Vector isalnum. More... | |
static vui8_t | vec_isdigit (vui8_t vec_str) |
Vector isdigit. More... | |
static vui8_t | vec_mrgahb (vui16_t vra, vui16_t vrb) |
Vector Merge Algebraic High Byte operation. More... | |
static vui8_t | vec_mrgalb (vui16_t vra, vui16_t vrb) |
Vector Merge Algebraic Low Byte operation. More... | |
static vui8_t | vec_mrgeb (vui8_t vra, vui8_t vrb) |
Vector Merge Even Bytes operation. More... | |
static vui8_t | vec_mrgob (vui8_t vra, vui8_t vrb) |
Vector Merge Odd Halfwords operation. More... | |
static vi8_t | vec_mulhsb (vi8_t vra, vi8_t vrb) |
Vector Multiply High Signed Bytes. More... | |
static vui8_t | vec_mulhub (vui8_t vra, vui8_t vrb) |
Vector Multiply High Unsigned Bytes. More... | |
static vui8_t | vec_mulubm (vui8_t vra, vui8_t vrb) |
Vector Multiply Unsigned Byte Modulo. More... | |
static vui8_t | vec_popcntb (vui8_t vra) |
Vector Population Count byte. More... | |
static vb8_t | vec_setb_sb (vi8_t vra) |
Vector Set Bool from Signed Byte. More... | |
static vui8_t | vec_slbi (vui8_t vra, const unsigned int shb) |
Vector Shift left Byte Immediate. More... | |
static vi8_t | vec_srabi (vi8_t vra, const unsigned int shb) |
Vector Shift Right Algebraic Byte Immediate. More... | |
static vui8_t | vec_srbi (vui8_t vra, const unsigned int shb) |
Vector Shift Right Byte Immediate. More... | |
static vui8_t | vec_shift_leftdo (vui8_t vrw, vui8_t vrx, vui8_t vrb) |
Shift left double quadword by octet. Return a vector unsigned char that is the left most 16 chars after shifting left 0-15 octets (chars) of the 32 char double vector (vrw||vrx). The octet shift amount is from bits 121:124 of vrb. More... | |
static vui8_t | vec_toupper (vui8_t vec_str) |
Vector toupper. More... | |
static vui8_t | vec_tolower (vui8_t vec_str) |
Vector tolower. More... | |
static vui8_t | vec_vmrgeb (vui8_t vra, vui8_t vrb) |
Vector Merge Even Bytes. More... | |
static vui8_t | vec_vmrgob (vui8_t vra, vui8_t vrb) |
Vector Merge Odd Byte. More... | |
Header package containing a collection of 128-bit SIMD operations over 8-bit integer (char) elements.
Most of these operations are implemented in a single VMX or VSX instruction on newer (POWER6/POWER7/POWER8/POWER9) processors. This header serves to fill in functional gaps for older (POWER7, POWER8) processors and provides in-line assembler implementations for older compilers that do not provide the build-ins.
Most vector char (8-bit integer) operations are are already covered by the original VMX (AKA Altivec) instructions. VMX intrinsic (compiler built-ins) operations are defined in <altivec.h> and described in the compiler documentation. PowerISA 2.07B (POWER8) added several useful byte operations (count leading zeros, population count) not included in the original VMX. PowerISA 3.0B (POWER9) adds several more (absolute difference, compare not equal, count trailing zeros, extend sign, extract/insert, and reverse bytes). Most of these intrinsic (compiler built-ins) operations are defined in <altivec.h> and described in the compiler documentation.
This header covers operations that are either:
It would be useful to provide a vector multiply high byte (return the high order 8-bits of the 16-bit product) operation. This can be used for multiplicative inverse (effectively integer divide) operations. Neither integer multiply high nor divide are available as vector instructions. However the multiply high byte operation can be composed from the existing multiply even/odd byte operations followed by the vector merge even byte operation. Similarly a multiply low (modulo) byte operation can be composed from the existing multiply even/odd byte operations followed by the vector merge odd byte operation.
As a prerequisite we need to provide the merge even/odd byte operations. While PowerISA has added these operations for word and doubleword, instructions are not defined for byte and halfword. Fortunately vector merge operations are just a special case of vector permute. So the vec_vmrgob() and vec_vmrgeb() implementation can use vec_perm and appropriate selection vectors to provide these merge operations.
As described for other element sizes this is complicated by little-endian (LE) support as specified in the OpenPOWER ABI and as implemented in the compilers. Little-endian changes the effective vector element numbering and the location of even and odd elements. This means that the vector built-ins provided by altivec.h may not generate the instructions you would expect.
So this header defines endian independent byte operations vec_vmrgeb() and vec_vmrgob(). These operations are used in the implementation of the endian sensitive vec_mrgeb() and vec_mrgob(). These support the OpenPOWER ABI mandated merge even/odd semantic.
We also provide the merge algebraic high/low operations vec_mrgahb() and vec_mrgalb() to simplify extended precision arithmetic. These implementations use vec_vmrgeb() and vec_vmrgob() as extended precision byte order does not change with endian. These operations are used in turn to implement multiply byte high/low/modulo (vec_mulhsb(), vec_mulhub(), vec_mulubm()).
These operations provide a basis for using the multiplicative inverse as a alternative to integer divide.
The performance characteristics of the merge and multiply byte operations are very similar to the halfword implementations. (see Performance data.).
High level performance estimates are provided as an aid to function selection when evaluating algorithms. For background on how Latency and Throughput are derived see: Performance data.
Vector Absolute Difference Unsigned byte.
Compute the absolute difference for each byte. For each unsigned byte, subtract B[i] from A[i] and return the absolute value of the difference.
processor | Latency | Throughput |
---|---|---|
power8 | 4 | 1/cycle |
power9 | 3 | 2/cycle |
vra | vector of 16 unsigned bytes |
vrb | vector of 16 unsigned bytes |
Vector Count Leading Zeros Byte for a unsigned char (byte) elements.
Count the number of leading '0' bits (0-7) within each byte element of a 128-bit vector.
For POWER8 (PowerISA 2.07B) or later use the Vector Count Leading Zeros Byte instruction vclzb. Otherwise use sequence of pre 2.07 VMX instructions. SIMDized count leading zeros inspired by:
Warren, Henry S. Jr and Hacker's Delight, 2nd Edition, Addison Wesley, 2013. Chapter 5 Counting Bits, Figure 5-12.
processor | Latency | Throughput |
---|---|---|
power8 | 2 | 2/cycle |
power9 | 3 | 2/cycle |
vra | 128-bit vector treated as 16 x 8-bit unsigned integer (byte) elements. |
Vector Count Trailing Zeros Byte for a unsigned char (byte) elements.
Count the number of trailing '0' bits (0-8) within each byte element of a 128-bit vector.
For POWER9 (PowerISA 3.0B) or later use the Vector Count Trailing Zeros Byte instruction vctzb. Otherwise use a sequence of pre ISA 3.0 VMX instructions. SIMDized count trailing zeros inspired by:
Warren, Henry S. Jr and Hacker's Delight, 2nd Edition, Addison Wesley, 2013. Chapter 5 Counting Bits, Section 5-4.
processor | Latency | Throughput |
---|---|---|
power8 | 6-8 | 2/cycle |
power9 | 3 | 2/cycle |
vra | 128-bit vector treated as 16 x 8-bit unsigned char (byte) elements. |
Vector isalpha.
Return a vector boolean char with a true indicator for any character that is either Lower Case Alpha ASCII or Upper Case ASCII. False otherwise.
processor | Latency | Throughput |
---|---|---|
power8 | 10-20 | 1/cycle |
power9 | 11-21 | 1/cycle |
vec_str | vector of 16 ASCII characters |
Vector isalnum.
Return a vector boolean char with a true indicator for any character that is either Lower Case Alpha ASCII, Upper Case ASCII, or numeric ASCII. False otherwise.
processor | Latency | Throughput |
---|---|---|
power8 | 9-18 | 1/cycle |
power9 | 10-19 | 1/cycle |
vec_str | vector of 16 ASCII characters |
Vector isdigit.
Return a vector boolean char with a true indicator for any character that is ASCII decimal digit. False otherwise.
processor | Latency | Throughput |
---|---|---|
power8 | 4-13 | 1/cycle |
power9 | 5-14 | 1/cycle |
vec_str | vector of 16 ASCII characters |
Vector Merge Algebraic High Byte operation.
Merge only the high byte from 16 x Algebraic halfwords across vectors vra and vrb. This is effectively the Vector Merge Even Byte operation that is not modified for Endian.
For example merge the high 8-bits from each of 16 x 16-bit products as generated by vec_muleub/vec_muloub. This result is effectively a vector multiply high unsigned byte.
processor | Latency | Throughput |
---|---|---|
power8 | 2-13 | 2/cycle |
power9 | 3-14 | 2/cycle |
vra | 128-bit vector unsigned short. |
vrb | 128-bit vector unsigned short. |
Vector Merge Algebraic Low Byte operation.
Merge only the low bytes from 16 x Algebraic halfwords across vectors vra and vrb. This is effectively the Vector Merge Odd Bytes operation that is not modified for Endian.
For example merge the low 8-bits from each of 16 x 16-bit products as generated by vec_muleub/vec_muloub. This result is effectively a vector multiply low unsigned byte.
processor | Latency | Throughput |
---|---|---|
power8 | 2-13 | 2/cycle |
power9 | 3-14 | 2/cycle |
vra | 128-bit vector unsigned int. |
vrb | 128-bit vector unsigned int. |
Vector Merge Even Bytes operation.
Merge the even byte elements from the concatenation of 2 x vectors (vra and vrb).
processor | Latency | Throughput |
---|---|---|
power8 | 2-13 | 2/cycle |
power9 | 3-14 | 2/cycle |
vra | 128-bit vector unsigned char. |
vrb | 128-bit vector unsigned char. |
Vector Merge Odd Halfwords operation.
Merge the odd halfword elements from the concatenation of 2 x vectors (vra and vrb).
processor | Latency | Throughput |
---|---|---|
power8 | 2-13 | 2/cycle |
power9 | 3-14 | 2/cycle |
vra | 128-bit vector unsigned char. |
vrb | 128-bit vector unsigned char. |
Vector Multiply High Signed Bytes.
Multiple the corresponding byte elements of two vector signed char values and return the high order 8-bits, for each 16-bit product element.
processor | Latency | Throughput |
---|---|---|
power8 | 9-13 | 1/cycle |
power9 | 10-14 | 1/cycle |
vra | 128-bit vector signed char. |
vrb | 128-bit vector signed char. |
Vector Multiply High Unsigned Bytes.
Multiple the corresponding byte elements of two vector unsigned char values and return the high order 8-bits, for each 16-bit product element.
processor | Latency | Throughput |
---|---|---|
power8 | 9-13 | 1/cycle |
power9 | 10-14 | 1/cycle |
vra | 128-bit vector unsigned char. |
vrb | 128-bit vector unsigned char. |
Vector Multiply Unsigned Byte Modulo.
Multiple the corresponding byte elements of two vector unsigned char values and return the low order 8-bits of the 16-bit product for each element.
processor | Latency | Throughput |
---|---|---|
power8 | 9-13 | 1/cycle |
power9 | 10-14 | 1/cycle |
vra | 128-bit vector unsigned char. |
vrb | 128-bit vector unsigned char. |
Vector Population Count byte.
Count the number of '1' bits (0-8) within each byte element of a 128-bit vector.
For POWER8 (PowerISA 2.07B) or later use the Vector Population Count Byte instruction. Otherwise use simple Vector (VMX) instructions to count bits in bytes in parallel. SIMDized population count inspired by:
Warren, Henry S. Jr and Hacker's Delight, 2nd Edition, Addison Wesley, 2013. Chapter 5 Counting Bits, Figure 5-2.
processor | Latency | Throughput |
---|---|---|
power8 | 2 | 2/cycle |
power9 | 3 | 2/cycle |
vra | 128-bit vector treated as 16 x 8-bit integers (byte) elements. |
Vector Set Bool from Signed Byte.
For each byte, propagate the sign bit to all 8-bits of that byte. The result is vector bool char reflecting the sign bit of each 8-bit byte.
processor | Latency | Throughput |
---|---|---|
power8 | 2-4 | 2/cycle |
power9 | 2-5 | 2/cycle |
vra | Vector signed char. |
Shift left double quadword by octet. Return a vector unsigned char that is the left most 16 chars after shifting left 0-15 octets (chars) of the 32 char double vector (vrw||vrx). The octet shift amount is from bits 121:124 of vrb.
This sequence can be used to align a unaligned 16 char substring based on the result of a vector count leading zero of of the compare boolean.
processor | Latency | Throughput |
---|---|---|
power8 | 6-8 | 1/cycle |
power9 | 8-9 | 1/cycle |
vrw | upper 16-bytes of the 32-byte double vector. |
vrx | lower 16-bytes of the 32-byte double vector. |
vrb | Shift amount in bits 121:124. |
Vector Shift left Byte Immediate.
Shift left each byte element [0-15], 0-7 bits, as specified by an immediate value. The shift amount is a const unsigned int in the range 0-7. A shift count of 0 returns the original value of vra. Shift counts greater then 7 bits return zero.
processor | Latency | Throughput |
---|---|---|
power8 | 4-11 | 2/cycle |
power9 | 5-11 | 2/cycle |
vra | a 128-bit vector treated as a vector unsigned char. |
shb | Shift amount in the range 0-7. |
Vector Shift Right Algebraic Byte Immediate.
Shift right each byte element [0-15], 0-7 bits, as specified by an immediate value. The shift amount is a const unsigned int in the range 0-7. A shift count of 0 returns the original value of vra. Shift counts greater then 7 bits return the sign bit propagated to each bit of each element.
processor | Latency | Throughput |
---|---|---|
power8 | 4-11 | 2/cycle |
power9 | 5-11 | 2/cycle |
vra | a 128-bit vector treated as a vector signed char. |
shb | Shift amount in the range 0-7. |
Vector Shift Right Byte Immediate.
Shift right each byte element [0-15], 0-7 bits, as specified by an immediate value. The shift amount is a const unsigned int in the range 0-7. A shift count of 0 returns the original value of vra. Shift counts greater then 7 bits return zero.
processor | Latency | Throughput |
---|---|---|
power8 | 4-11 | 2/cycle |
power9 | 5-11 | 2/cycle |
vra | a 128-bit vector treated as a vector unsigned char. |
shb | Shift amount in the range 0-7. |
Vector tolower.
Convert any Upper Case Alpha ASCII characters within a vector unsigned char into the equivalent Lower Case character. Return the result as a vector unsigned char.
processor | Latency | Throughput |
---|---|---|
power8 | 8-17 | 1/cycle |
power9 | 9-18 | 1/cycle |
vec_str | vector of 16 ASCII characters |
Vector toupper.
Convert any Lower Case Alpha ASCII characters within a vector unsigned char into the equivalent Upper Case character. Return the result as a vector unsigned char.
processor | Latency | Throughput |
---|---|---|
power8 | 8-17 | 1/cycle |
power9 | 9-18 | 1/cycle |
vec_str | vector of 16 ASCII characters |
Vector Merge Even Bytes.
Merge the even byte elements from the concatenation of 2 x vectors (vra and vrb).
processor | Latency | Throughput |
---|---|---|
power8 | 2-13 | 2/cycle |
power9 | 3-14 | 2/cycle |
vra | 128-bit vector unsigned char. |
vrb | 128-bit vector unsigned char. |
Vector Merge Odd Byte.
Merge the odd byte elements from the concatenation of 2 x vectors (vra and vrb).
processor | Latency | Throughput |
---|---|---|
power8 | 2-13 | 2/cycle |
power9 | 3-14 | 2/cycle |
vra | 128-bit vector unsigned char. |
vrb | 128-bit vector unsigned char. |