POWER Vector Library Manual  1.0.4
Functions
vec_char_ppc.h File Reference

Header package containing a collection of 128-bit SIMD operations over 8-bit integer (char) elements. More...

#include <pveclib/vec_common_ppc.h>

Go to the source code of this file.

Functions

static vui8_t vec_absdub (vui8_t vra, vui8_t vrb)
 Vector Absolute Difference Unsigned byte. More...
 
static vui8_t vec_clzb (vui8_t vra)
 Vector Count Leading Zeros Byte for a unsigned char (byte) elements. More...
 
static vui8_t vec_ctzb (vui8_t vra)
 Vector Count Trailing Zeros Byte for a unsigned char (byte) elements. More...
 
static vui8_t vec_isalnum (vui8_t vec_str)
 Vector isalpha. More...
 
static vui8_t vec_isalpha (vui8_t vec_str)
 Vector isalnum. More...
 
static vui8_t vec_isdigit (vui8_t vec_str)
 Vector isdigit. More...
 
static vui8_t vec_mrgahb (vui16_t vra, vui16_t vrb)
 Vector Merge Algebraic High Byte operation. More...
 
static vui8_t vec_mrgalb (vui16_t vra, vui16_t vrb)
 Vector Merge Algebraic Low Byte operation. More...
 
static vui8_t vec_mrgeb (vui8_t vra, vui8_t vrb)
 Vector Merge Even Bytes operation. More...
 
static vui8_t vec_mrgob (vui8_t vra, vui8_t vrb)
 Vector Merge Odd Halfwords operation. More...
 
static vi8_t vec_mulhsb (vi8_t vra, vi8_t vrb)
 Vector Multiply High Signed Bytes. More...
 
static vui8_t vec_mulhub (vui8_t vra, vui8_t vrb)
 Vector Multiply High Unsigned Bytes. More...
 
static vui8_t vec_mulubm (vui8_t vra, vui8_t vrb)
 Vector Multiply Unsigned Byte Modulo. More...
 
static vui8_t vec_popcntb (vui8_t vra)
 Vector Population Count byte. More...
 
static vb8_t vec_setb_sb (vi8_t vra)
 Vector Set Bool from Signed Byte. More...
 
static vui8_t vec_slbi (vui8_t vra, const unsigned int shb)
 Vector Shift left Byte Immediate. More...
 
static vi8_t vec_srabi (vi8_t vra, const unsigned int shb)
 Vector Shift Right Algebraic Byte Immediate. More...
 
static vui8_t vec_srbi (vui8_t vra, const unsigned int shb)
 Vector Shift Right Byte Immediate. More...
 
static vui8_t vec_shift_leftdo (vui8_t vrw, vui8_t vrx, vui8_t vrb)
 Shift left double quadword by octet. Return a vector unsigned char that is the left most 16 chars after shifting left 0-15 octets (chars) of the 32 char double vector (vrw||vrx). The octet shift amount is from bits 121:124 of vrb. More...
 
static vui8_t vec_toupper (vui8_t vec_str)
 Vector toupper. More...
 
static vui8_t vec_tolower (vui8_t vec_str)
 Vector tolower. More...
 
static vui8_t vec_vmrgeb (vui8_t vra, vui8_t vrb)
 Vector Merge Even Bytes. More...
 
static vui8_t vec_vmrgob (vui8_t vra, vui8_t vrb)
 Vector Merge Odd Byte. More...
 

Detailed Description

Header package containing a collection of 128-bit SIMD operations over 8-bit integer (char) elements.

Most of these operations are implemented in a single VMX or VSX instruction on newer (POWER6/POWER7/POWER8/POWER9) processors. This header serves to fill in functional gaps for older (POWER7, POWER8) processors and provides in-line assembler implementations for older compilers that do not provide the build-ins.

Most vector char (8-bit integer) operations are are already covered by the original VMX (AKA Altivec) instructions. VMX intrinsic (compiler built-ins) operations are defined in <altivec.h> and described in the compiler documentation. PowerISA 2.07B (POWER8) added several useful byte operations (count leading zeros, population count) not included in the original VMX. PowerISA 3.0B (POWER9) adds several more (absolute difference, compare not equal, count trailing zeros, extend sign, extract/insert, and reverse bytes). Most of these intrinsic (compiler built-ins) operations are defined in <altivec.h> and described in the compiler documentation.

Note
The compiler disables associated <altivec.h> built-ins if the mcpu target does not enable the specific instruction. For example if you compile with -mcpu=power7, vec_vclz and vec_vclzb will not be defined. But vec_clzb is always defined in this header, will generate the minimum code, appropriate for the target, and produce correct results.

This header covers operations that are either:

Endian problems with byte operations

It would be useful to provide a vector multiply high byte (return the high order 8-bits of the 16-bit product) operation. This can be used for multiplicative inverse (effectively integer divide) operations. Neither integer multiply high nor divide are available as vector instructions. However the multiply high byte operation can be composed from the existing multiply even/odd byte operations followed by the vector merge even byte operation. Similarly a multiply low (modulo) byte operation can be composed from the existing multiply even/odd byte operations followed by the vector merge odd byte operation.

As a prerequisite we need to provide the merge even/odd byte operations. While PowerISA has added these operations for word and doubleword, instructions are not defined for byte and halfword. Fortunately vector merge operations are just a special case of vector permute. So the vec_vmrgob() and vec_vmrgeb() implementation can use vec_perm and appropriate selection vectors to provide these merge operations.

As described for other element sizes this is complicated by little-endian (LE) support as specified in the OpenPOWER ABI and as implemented in the compilers. Little-endian changes the effective vector element numbering and the location of even and odd elements. This means that the vector built-ins provided by altivec.h may not generate the instructions you would expect.

See also
Endian problems with halfword operations
General Endian Issues

So this header defines endian independent byte operations vec_vmrgeb() and vec_vmrgob(). These operations are used in the implementation of the endian sensitive vec_mrgeb() and vec_mrgob(). These support the OpenPOWER ABI mandated merge even/odd semantic.

We also provide the merge algebraic high/low operations vec_mrgahb() and vec_mrgalb() to simplify extended precision arithmetic. These implementations use vec_vmrgeb() and vec_vmrgob() as extended precision byte order does not change with endian. These operations are used in turn to implement multiply byte high/low/modulo (vec_mulhsb(), vec_mulhub(), vec_mulubm()).

These operations provide a basis for using the multiplicative inverse as a alternative to integer divide.

See also
Examples, Divide by integer constant

Performance data.

The performance characteristics of the merge and multiply byte operations are very similar to the halfword implementations. (see Performance data.).

More information.

High level performance estimates are provided as an aid to function selection when evaluating algorithms. For background on how Latency and Throughput are derived see: Performance data.

Function Documentation

◆ vec_absdub()

static vui8_t vec_absdub ( vui8_t  vra,
vui8_t  vrb 
)
inlinestatic

Vector Absolute Difference Unsigned byte.

Compute the absolute difference for each byte. For each unsigned byte, subtract B[i] from A[i] and return the absolute value of the difference.

processor Latency Throughput
power8 4 1/cycle
power9 3 2/cycle
Parameters
vravector of 16 unsigned bytes
vrbvector of 16 unsigned bytes
Returns
vector of the absolute difference.

◆ vec_clzb()

static vui8_t vec_clzb ( vui8_t  vra)
inlinestatic

Vector Count Leading Zeros Byte for a unsigned char (byte) elements.

Count the number of leading '0' bits (0-7) within each byte element of a 128-bit vector.

For POWER8 (PowerISA 2.07B) or later use the Vector Count Leading Zeros Byte instruction vclzb. Otherwise use sequence of pre 2.07 VMX instructions. SIMDized count leading zeros inspired by:

Warren, Henry S. Jr and Hacker's Delight, 2nd Edition, Addison Wesley, 2013. Chapter 5 Counting Bits, Figure 5-12.

processor Latency Throughput
power8 2 2/cycle
power9 3 2/cycle
Parameters
vra128-bit vector treated as 16 x 8-bit unsigned integer (byte) elements.
Returns
128-bit vector with the leading zeros count for each byte element.

◆ vec_ctzb()

static vui8_t vec_ctzb ( vui8_t  vra)
inlinestatic

Vector Count Trailing Zeros Byte for a unsigned char (byte) elements.

Count the number of trailing '0' bits (0-8) within each byte element of a 128-bit vector.

For POWER9 (PowerISA 3.0B) or later use the Vector Count Trailing Zeros Byte instruction vctzb. Otherwise use a sequence of pre ISA 3.0 VMX instructions. SIMDized count trailing zeros inspired by:

Warren, Henry S. Jr and Hacker's Delight, 2nd Edition, Addison Wesley, 2013. Chapter 5 Counting Bits, Section 5-4.

processor Latency Throughput
power8 6-8 2/cycle
power9 3 2/cycle
Parameters
vra128-bit vector treated as 16 x 8-bit unsigned char (byte) elements.
Returns
128-bit vector with the trailing zeros count for each byte element.

◆ vec_isalnum()

static vui8_t vec_isalnum ( vui8_t  vec_str)
inlinestatic

Vector isalpha.

Return a vector boolean char with a true indicator for any character that is either Lower Case Alpha ASCII or Upper Case ASCII. False otherwise.

processor Latency Throughput
power8 10-20 1/cycle
power9 11-21 1/cycle
Parameters
vec_strvector of 16 ASCII characters
Returns
vector bool char of the isalpha operation applied to each character of vec_str. For each byte 0xff indicates true (isalpha), 0x00 indicates false.

◆ vec_isalpha()

static vui8_t vec_isalpha ( vui8_t  vec_str)
inlinestatic

Vector isalnum.

Return a vector boolean char with a true indicator for any character that is either Lower Case Alpha ASCII, Upper Case ASCII, or numeric ASCII. False otherwise.

processor Latency Throughput
power8 9-18 1/cycle
power9 10-19 1/cycle
Parameters
vec_strvector of 16 ASCII characters
Returns
vector bool char of the isalnum operation applied to each character of vec_str. For each byte 0xff indicates true (isalpha), 0x00 indicates false.

◆ vec_isdigit()

static vui8_t vec_isdigit ( vui8_t  vec_str)
inlinestatic

Vector isdigit.

Return a vector boolean char with a true indicator for any character that is ASCII decimal digit. False otherwise.

processor Latency Throughput
power8 4-13 1/cycle
power9 5-14 1/cycle
Parameters
vec_strvector of 16 ASCII characters
Returns
vector bool char of the isdigit operation applied to each character of vec_str. For each byte 0xff indicates true (isdigit), 0x00 indicates false.

◆ vec_mrgahb()

static vui8_t vec_mrgahb ( vui16_t  vra,
vui16_t  vrb 
)
inlinestatic

Vector Merge Algebraic High Byte operation.

Merge only the high byte from 16 x Algebraic halfwords across vectors vra and vrb. This is effectively the Vector Merge Even Byte operation that is not modified for Endian.

For example merge the high 8-bits from each of 16 x 16-bit products as generated by vec_muleub/vec_muloub. This result is effectively a vector multiply high unsigned byte.

processor Latency Throughput
power8 2-13 2/cycle
power9 3-14 2/cycle
Parameters
vra128-bit vector unsigned short.
vrb128-bit vector unsigned short.
Returns
A vector merge from only the high bytes of the 16 x Algebraic halfwords across vra and vrb.

◆ vec_mrgalb()

static vui8_t vec_mrgalb ( vui16_t  vra,
vui16_t  vrb 
)
inlinestatic

Vector Merge Algebraic Low Byte operation.

Merge only the low bytes from 16 x Algebraic halfwords across vectors vra and vrb. This is effectively the Vector Merge Odd Bytes operation that is not modified for Endian.

For example merge the low 8-bits from each of 16 x 16-bit products as generated by vec_muleub/vec_muloub. This result is effectively a vector multiply low unsigned byte.

processor Latency Throughput
power8 2-13 2/cycle
power9 3-14 2/cycle
Parameters
vra128-bit vector unsigned int.
vrb128-bit vector unsigned int.
Returns
A vector merge from only the high halfwords of the 8 x Algebraic words across vra and vrb.

◆ vec_mrgeb()

static vui8_t vec_mrgeb ( vui8_t  vra,
vui8_t  vrb 
)
inlinestatic

Vector Merge Even Bytes operation.

Merge the even byte elements from the concatenation of 2 x vectors (vra and vrb).

Note
The element numbering changes between Big and Little Endian. So the compiler and this implementation adjusts the generated code to reflect this.
processor Latency Throughput
power8 2-13 2/cycle
power9 3-14 2/cycle
Parameters
vra128-bit vector unsigned char.
vrb128-bit vector unsigned char.
Returns
A vector merge from only the even bytes of vra and vrb.

◆ vec_mrgob()

static vui8_t vec_mrgob ( vui8_t  vra,
vui8_t  vrb 
)
inlinestatic

Vector Merge Odd Halfwords operation.

Merge the odd halfword elements from the concatenation of 2 x vectors (vra and vrb).

Note
The element numbering changes between Big and Little Endian. So the compiler and this implementation adjusts the generated code to reflect this.
processor Latency Throughput
power8 2-13 2/cycle
power9 3-14 2/cycle
Parameters
vra128-bit vector unsigned char.
vrb128-bit vector unsigned char.
Returns
A vector merge from only the odd bytes of vra and vrb.

◆ vec_mulhsb()

static vi8_t vec_mulhsb ( vi8_t  vra,
vi8_t  vrb 
)
inlinestatic

Vector Multiply High Signed Bytes.

Multiple the corresponding byte elements of two vector signed char values and return the high order 8-bits, for each 16-bit product element.

processor Latency Throughput
power8 9-13 1/cycle
power9 10-14 1/cycle
Parameters
vra128-bit vector signed char.
vrb128-bit vector signed char.
Returns
vector of the high order 8-bits of the product of the byte elements from vra and vrb.

◆ vec_mulhub()

static vui8_t vec_mulhub ( vui8_t  vra,
vui8_t  vrb 
)
inlinestatic

Vector Multiply High Unsigned Bytes.

Multiple the corresponding byte elements of two vector unsigned char values and return the high order 8-bits, for each 16-bit product element.

processor Latency Throughput
power8 9-13 1/cycle
power9 10-14 1/cycle
Parameters
vra128-bit vector unsigned char.
vrb128-bit vector unsigned char.
Returns
vector of the high order 8-bits of the product of the byte elements from vra and vrb.

◆ vec_mulubm()

static vui8_t vec_mulubm ( vui8_t  vra,
vui8_t  vrb 
)
inlinestatic

Vector Multiply Unsigned Byte Modulo.

Multiple the corresponding byte elements of two vector unsigned char values and return the low order 8-bits of the 16-bit product for each element.

Note
vec_mulubm can be used for unsigned or signed char integers. It is the vector equivalent of Multiply Low Byte.
processor Latency Throughput
power8 9-13 1/cycle
power9 10-14 1/cycle
Parameters
vra128-bit vector unsigned char.
vrb128-bit vector unsigned char.
Returns
vector of the low order 8-bits of the unsigned product of the byte elements from vra and vrb.

◆ vec_popcntb()

static vui8_t vec_popcntb ( vui8_t  vra)
inlinestatic

Vector Population Count byte.

Count the number of '1' bits (0-8) within each byte element of a 128-bit vector.

For POWER8 (PowerISA 2.07B) or later use the Vector Population Count Byte instruction. Otherwise use simple Vector (VMX) instructions to count bits in bytes in parallel. SIMDized population count inspired by:

Warren, Henry S. Jr and Hacker's Delight, 2nd Edition, Addison Wesley, 2013. Chapter 5 Counting Bits, Figure 5-2.

processor Latency Throughput
power8 2 2/cycle
power9 3 2/cycle
Parameters
vra128-bit vector treated as 16 x 8-bit integers (byte) elements.
Returns
128-bit vector with the population count for each byte element.

◆ vec_setb_sb()

static vb8_t vec_setb_sb ( vi8_t  vra)
inlinestatic

Vector Set Bool from Signed Byte.

For each byte, propagate the sign bit to all 8-bits of that byte. The result is vector bool char reflecting the sign bit of each 8-bit byte.

processor Latency Throughput
power8 2-4 2/cycle
power9 2-5 2/cycle
Parameters
vraVector signed char.
Returns
vector bool char reflecting the sign bit of each byte.

◆ vec_shift_leftdo()

static vui8_t vec_shift_leftdo ( vui8_t  vrw,
vui8_t  vrx,
vui8_t  vrb 
)
inlinestatic

Shift left double quadword by octet. Return a vector unsigned char that is the left most 16 chars after shifting left 0-15 octets (chars) of the 32 char double vector (vrw||vrx). The octet shift amount is from bits 121:124 of vrb.

This sequence can be used to align a unaligned 16 char substring based on the result of a vector count leading zero of of the compare boolean.

processor Latency Throughput
power8 6-8 1/cycle
power9 8-9 1/cycle
Parameters
vrwupper 16-bytes of the 32-byte double vector.
vrxlower 16-bytes of the 32-byte double vector.
vrbShift amount in bits 121:124.
Returns
upper 16-bytes of left shifted double vector.

◆ vec_slbi()

static vui8_t vec_slbi ( vui8_t  vra,
const unsigned int  shb 
)
inlinestatic

Vector Shift left Byte Immediate.

Shift left each byte element [0-15], 0-7 bits, as specified by an immediate value. The shift amount is a const unsigned int in the range 0-7. A shift count of 0 returns the original value of vra. Shift counts greater then 7 bits return zero.

processor Latency Throughput
power8 4-11 2/cycle
power9 5-11 2/cycle
Parameters
vraa 128-bit vector treated as a vector unsigned char.
shbShift amount in the range 0-7.
Returns
128-bit vector unsigned char, shifted left shb bits.

◆ vec_srabi()

static vi8_t vec_srabi ( vi8_t  vra,
const unsigned int  shb 
)
inlinestatic

Vector Shift Right Algebraic Byte Immediate.

Shift right each byte element [0-15], 0-7 bits, as specified by an immediate value. The shift amount is a const unsigned int in the range 0-7. A shift count of 0 returns the original value of vra. Shift counts greater then 7 bits return the sign bit propagated to each bit of each element.

processor Latency Throughput
power8 4-11 2/cycle
power9 5-11 2/cycle
Parameters
vraa 128-bit vector treated as a vector signed char.
shbShift amount in the range 0-7.
Returns
128-bit vector signed char, shifted right shb bits.

◆ vec_srbi()

static vui8_t vec_srbi ( vui8_t  vra,
const unsigned int  shb 
)
inlinestatic

Vector Shift Right Byte Immediate.

Shift right each byte element [0-15], 0-7 bits, as specified by an immediate value. The shift amount is a const unsigned int in the range 0-7. A shift count of 0 returns the original value of vra. Shift counts greater then 7 bits return zero.

processor Latency Throughput
power8 4-11 2/cycle
power9 5-11 2/cycle
Parameters
vraa 128-bit vector treated as a vector unsigned char.
shbShift amount in the range 0-7.
Returns
128-bit vector unsigned char, shifted right shb bits.

◆ vec_tolower()

static vui8_t vec_tolower ( vui8_t  vec_str)
inlinestatic

Vector tolower.

Convert any Upper Case Alpha ASCII characters within a vector unsigned char into the equivalent Lower Case character. Return the result as a vector unsigned char.

processor Latency Throughput
power8 8-17 1/cycle
power9 9-18 1/cycle
Parameters
vec_strvector of 16 ASCII characters
Returns
vector char converted to lower case.

◆ vec_toupper()

static vui8_t vec_toupper ( vui8_t  vec_str)
inlinestatic

Vector toupper.

Convert any Lower Case Alpha ASCII characters within a vector unsigned char into the equivalent Upper Case character. Return the result as a vector unsigned char.

processor Latency Throughput
power8 8-17 1/cycle
power9 9-18 1/cycle
Parameters
vec_strvector of 16 ASCII characters
Returns
vector char converted to upper case.

◆ vec_vmrgeb()

static vui8_t vec_vmrgeb ( vui8_t  vra,
vui8_t  vrb 
)
inlinestatic

Vector Merge Even Bytes.

Merge the even byte elements from the concatenation of 2 x vectors (vra and vrb).

Note
This function implements the operation of a Vector Merge Even Bytes instruction, if the PowerISA included such an instruction. This implementation is NOT Endian sensitive and the function is stable across BE/LE implementations. Using Big Endian element numbering:
  • res[0] = vra[0];
  • res[1] = vrb[0];
  • res[2] = vra[2];
  • res[3] = vrb[2];
  • res[4] = vra[4];
  • res[5] = vrb[4];
  • res[6] = vra[6];
  • res[7] = vrb[6];
  • res[8] = vra[8];
  • res[9] = vrb[8];
  • res[10] = vra[10];
  • res[11] = vrb[10];
  • res[12] = vra[12];
  • res[13] = vrb[12];
  • res[14] = vra[14];
  • res[15] = vrb[14];
processor Latency Throughput
power8 2-13 2/cycle
power9 3-14 2/cycle
Parameters
vra128-bit vector unsigned char.
vrb128-bit vector unsigned char.
Returns
A vector merge from only the even bytes of vra and vrb.

◆ vec_vmrgob()

static vui8_t vec_vmrgob ( vui8_t  vra,
vui8_t  vrb 
)
inlinestatic

Vector Merge Odd Byte.

Merge the odd byte elements from the concatenation of 2 x vectors (vra and vrb).

Note
This function implements the operation of a Vector Merge Odd Bytes instruction, if the PowerISA included such an instruction. This implementation is NOT Endian sensitive and the function is stable across BE/LE implementations. Using Big Endian element numbering:
  • res[0] = vra[1];
  • res[1] = vrb[1];
  • res[2] = vra[3];
  • res[3] = vrb[3];
  • res[4] = vra[5];
  • res[5] = vrb[5];
  • res[6] = vra[7];
  • res[7] = vrb[7];
  • res[8] = vra[9];
  • res[9] = vrb[9];
  • res[10] = vra[11];
  • res[11] = vrb[11];
  • res[12] = vra[13];
  • res[13] = vrb[13];
  • res[14] = vra[15];
  • res[15] = vrb[15];
processor Latency Throughput
power8 2-13 2/cycle
power9 3-14 2/cycle
Parameters
vra128-bit vector unsigned char.
vrb128-bit vector unsigned char.
Returns
A vector merge from only the odd bytes of vra and vrb.