Header package containing a collection of 128-bit SIMD operations over 8-bit integer (char) elements. More...

#include <pveclib/vec_common_ppc.h>

Functions
static vui8_t	vec_absdub (vui8_t vra, vui8_t vrb)
	Vector Absolute Difference Unsigned byte. More...

static vui8_t	vec_clzb (vui8_t vra)
	Vector Count Leading Zeros Byte for a unsigned char (byte) elements. More...

static vui8_t	vec_ctzb (vui8_t vra)
	Vector Count Trailing Zeros Byte for a unsigned char (byte) elements. More...

static vui8_t	vec_isalnum (vui8_t vec_str)
	Vector isalpha. More...

static vui8_t	vec_isalpha (vui8_t vec_str)
	Vector isalnum. More...

static vui8_t	vec_isdigit (vui8_t vec_str)
	Vector isdigit. More...

static vui8_t	vec_mrgahb (vui16_t vra, vui16_t vrb)
	Vector Merge Algebraic High Byte operation. More...

static vui8_t	vec_mrgalb (vui16_t vra, vui16_t vrb)
	Vector Merge Algebraic Low Byte operation. More...

static vui8_t	vec_mrgeb (vui8_t vra, vui8_t vrb)
	Vector Merge Even Bytes operation. More...

static vui8_t	vec_mrgob (vui8_t vra, vui8_t vrb)
	Vector Merge Odd Halfwords operation. More...

static vi8_t	vec_mulhsb (vi8_t vra, vi8_t vrb)
	Vector Multiply High Signed Bytes. More...

static vui8_t	vec_mulhub (vui8_t vra, vui8_t vrb)
	Vector Multiply High Unsigned Bytes. More...

static vui8_t	vec_mulubm (vui8_t vra, vui8_t vrb)
	Vector Multiply Unsigned Byte Modulo. More...

static vui8_t	vec_popcntb (vui8_t vra)
	Vector Population Count byte. More...

static vb8_t	vec_setb_sb (vi8_t vra)
	Vector Set Bool from Signed Byte. More...

static vui8_t	vec_slbi (vui8_t vra, const unsigned int shb)
	Vector Shift left Byte Immediate. More...

static vi8_t	vec_srabi (vi8_t vra, const unsigned int shb)
	Vector Shift Right Algebraic Byte Immediate. More...

static vui8_t	vec_srbi (vui8_t vra, const unsigned int shb)
	Vector Shift Right Byte Immediate. More...

static vui8_t	vec_shift_leftdo (vui8_t vrw, vui8_t vrx, vui8_t vrb)
	Shift left double quadword by octet. Return a vector unsigned char that is the left most 16 chars after shifting left 0-15 octets (chars) of the 32 char double vector (vrw\|\|vrx). The octet shift amount is from bits 121:124 of vrb. More...

static vui8_t	vec_toupper (vui8_t vec_str)
	Vector toupper. More...

static vui8_t	vec_tolower (vui8_t vec_str)
	Vector tolower. More...

static vui8_t	vec_vmrgeb (vui8_t vra, vui8_t vrb)
	Vector Merge Even Bytes. More...

static vui8_t	vec_vmrgob (vui8_t vra, vui8_t vrb)
	Vector Merge Odd Byte. More...

Detailed Description

Header package containing a collection of 128-bit SIMD operations over 8-bit integer (char) elements.

Most of these operations are implemented in a single VMX or VSX instruction on newer (POWER6/POWER7/POWER8/POWER9) processors. This header serves to fill in functional gaps for older (POWER7, POWER8) processors and provides in-line assembler implementations for older compilers that do not provide the build-ins.

Most vector char (8-bit integer) operations are are already covered by the original VMX (AKA Altivec) instructions. VMX intrinsic (compiler built-ins) operations are defined in <altivec.h> and described in the compiler documentation. PowerISA 2.07B (POWER8) added several useful byte operations (count leading zeros, population count) not included in the original VMX. PowerISA 3.0B (POWER9) adds several more (absolute difference, compare not equal, count trailing zeros, extend sign, extract/insert, and reverse bytes). Most of these intrinsic (compiler built-ins) operations are defined in <altivec.h> and described in the compiler documentation.

Note: The compiler disables associated <altivec.h> built-ins if the mcpu target does not enable the specific instruction. For example if you compile with -mcpu=power7, vec_vclz and vec_vclzb will not be defined. But vec_clzb is always defined in this header, will generate the minimum code, appropriate for the target, and produce correct results.

This header covers operations that are either:

Implemented in later processors and useful to programmers if the same operations are available on slightly older processors. This is required even if the operation is defined in the OpenPOWER ABI or <altivec.h>, as the compiler disables the associated built-ins if the mcpu target does not enable the instruction.
Defined in the OpenPOWER ABI but not yet defined in <altivec.n> provided by available compilers in common use. Examples include Count Leading Zeros and Population Count.
Commonly used operations, not covered by the ABI or <altivec.h>, and require multiple instructions or are not obvious. Examples include the multiply high, ASCII character tests, and shift immediate operations.

Endian problems with byte operations

It would be useful to provide a vector multiply high byte (return the high order 8-bits of the 16-bit product) operation. This can be used for multiplicative inverse (effectively integer divide) operations. Neither integer multiply high nor divide are available as vector instructions. However the multiply high byte operation can be composed from the existing multiply even/odd byte operations followed by the vector merge even byte operation. Similarly a multiply low (modulo) byte operation can be composed from the existing multiply even/odd byte operations followed by the vector merge odd byte operation.

As a prerequisite we need to provide the merge even/odd byte operations. While PowerISA has added these operations for word and doubleword, instructions are not defined for byte and halfword. Fortunately vector merge operations are just a special case of vector permute. So the vec_vmrgob() and vec_vmrgeb() implementation can use vec_perm and appropriate selection vectors to provide these merge operations.

As described for other element sizes this is complicated by little-endian (LE) support as specified in the OpenPOWER ABI and as implemented in the compilers. Little-endian changes the effective vector element numbering and the location of even and odd elements. This means that the vector built-ins provided by altivec.h may not generate the instructions you would expect.

See also: Endian problems with halfword operations; General Endian Issues

So this header defines endian independent byte operations vec_vmrgeb() and vec_vmrgob(). These operations are used in the implementation of the endian sensitive vec_mrgeb() and vec_mrgob(). These support the OpenPOWER ABI mandated merge even/odd semantic.

We also provide the merge algebraic high/low operations vec_mrgahb() and vec_mrgalb() to simplify extended precision arithmetic. These implementations use vec_vmrgeb() and vec_vmrgob() as extended precision byte order does not change with endian. These operations are used in turn to implement multiply byte high/low/modulo (vec_mulhsb(), vec_mulhub(), vec_mulubm()).

These operations provide a basis for using the multiplicative inverse as a alternative to integer divide.

See also: Examples, Divide by integer constant

Performance data.

The performance characteristics of the merge and multiply byte operations are very similar to the halfword implementations. (see Performance data.).

More information.

High level performance estimates are provided as an aid to function selection when evaluating algorithms. For background on how Latency and Throughput are derived see: Performance data.

Function Documentation

◆ vec_absdub()

static vui8_t vec_absdub	(	vui8_t	vra,
		vui8_t	vrb
	)

inlinestatic

Vector Absolute Difference Unsigned byte.

Compute the absolute difference for each byte. For each unsigned byte, subtract B[i] from A[i] and return the absolute value of the difference.

processor	Latency	Throughput
power8	4	1/cycle
power9	3	2/cycle

Parameters

vra	vector of 16 unsigned bytes
vrb	vector of 16 unsigned bytes

Returns: vector of the absolute difference.

◆ vec_clzb()

static vui8_t vec_clzb ( vui8_t vra )

inlinestatic

Vector Count Leading Zeros Byte for a unsigned char (byte) elements.

Count the number of leading '0' bits (0-7) within each byte element of a 128-bit vector.

For POWER8 (PowerISA 2.07B) or later use the Vector Count Leading Zeros Byte instruction vclzb. Otherwise use sequence of pre 2.07 VMX instructions. SIMDized count leading zeros inspired by:

Warren, Henry S. Jr and Hacker's Delight, 2nd Edition, Addison Wesley, 2013. Chapter 5 Counting Bits, Figure 5-12.

processor	Latency	Throughput
power8	2	2/cycle
power9	3	2/cycle

Parameters

vra	128-bit vector treated as 16 x 8-bit unsigned integer (byte) elements.

Returns: 128-bit vector with the leading zeros count for each byte element.

◆ vec_ctzb()

static vui8_t vec_ctzb ( vui8_t vra )

inlinestatic

Vector Count Trailing Zeros Byte for a unsigned char (byte) elements.

Count the number of trailing '0' bits (0-8) within each byte element of a 128-bit vector.

For POWER9 (PowerISA 3.0B) or later use the Vector Count Trailing Zeros Byte instruction vctzb. Otherwise use a sequence of pre ISA 3.0 VMX instructions. SIMDized count trailing zeros inspired by:

Warren, Henry S. Jr and Hacker's Delight, 2nd Edition, Addison Wesley, 2013. Chapter 5 Counting Bits, Section 5-4.

processor	Latency	Throughput
power8	6-8	2/cycle
power9	3	2/cycle

Parameters

vra	128-bit vector treated as 16 x 8-bit unsigned char (byte) elements.

Returns: 128-bit vector with the trailing zeros count for each byte element.

◆ vec_isalnum()

static vui8_t vec_isalnum ( vui8_t vec_str )

inlinestatic

Vector isalpha.

Return a vector boolean char with a true indicator for any character that is either Lower Case Alpha ASCII or Upper Case ASCII. False otherwise.

processor	Latency	Throughput
power8	10-20	1/cycle
power9	11-21	1/cycle

Parameters

vec_str vector of 16 ASCII characters

Returns: vector bool char of the isalpha operation applied to each character of vec_str. For each byte 0xff indicates true (isalpha), 0x00 indicates false.

◆ vec_isalpha()

static vui8_t vec_isalpha ( vui8_t vec_str )

inlinestatic

Vector isalnum.

Return a vector boolean char with a true indicator for any character that is either Lower Case Alpha ASCII, Upper Case ASCII, or numeric ASCII. False otherwise.

processor	Latency	Throughput
power8	9-18	1/cycle
power9	10-19	1/cycle

Parameters

vec_str vector of 16 ASCII characters

Returns: vector bool char of the isalnum operation applied to each character of vec_str. For each byte 0xff indicates true (isalpha), 0x00 indicates false.

◆ vec_isdigit()

static vui8_t vec_isdigit ( vui8_t vec_str )

inlinestatic

Vector isdigit.

Return a vector boolean char with a true indicator for any character that is ASCII decimal digit. False otherwise.

processor	Latency	Throughput
power8	4-13	1/cycle
power9	5-14	1/cycle

Parameters

vec_str vector of 16 ASCII characters

Returns: vector bool char of the isdigit operation applied to each character of vec_str. For each byte 0xff indicates true (isdigit), 0x00 indicates false.

◆ vec_mrgahb()

static vui8_t vec_mrgahb	(	vui16_t	vra,
		vui16_t	vrb
	)

inlinestatic

Vector Merge Algebraic High Byte operation.

Merge only the high byte from 16 x Algebraic halfwords across vectors vra and vrb. This is effectively the Vector Merge Even Byte operation that is not modified for Endian.

For example merge the high 8-bits from each of 16 x 16-bit products as generated by vec_muleub/vec_muloub. This result is effectively a vector multiply high unsigned byte.

processor	Latency	Throughput
power8	2-13	2/cycle
power9	3-14	2/cycle

Parameters

vra	128-bit vector unsigned short.
vrb	128-bit vector unsigned short.

Returns: A vector merge from only the high bytes of the 16 x Algebraic halfwords across vra and vrb.

◆ vec_mrgalb()

static vui8_t vec_mrgalb	(	vui16_t	vra,
		vui16_t	vrb
	)

inlinestatic

Vector Merge Algebraic Low Byte operation.

Merge only the low bytes from 16 x Algebraic halfwords across vectors vra and vrb. This is effectively the Vector Merge Odd Bytes operation that is not modified for Endian.

For example merge the low 8-bits from each of 16 x 16-bit products as generated by vec_muleub/vec_muloub. This result is effectively a vector multiply low unsigned byte.

processor	Latency	Throughput
power8	2-13	2/cycle
power9	3-14	2/cycle

Parameters

vra	128-bit vector unsigned int.
vrb	128-bit vector unsigned int.

Returns: A vector merge from only the high halfwords of the 8 x Algebraic words across vra and vrb.

◆ vec_mrgeb()

static vui8_t vec_mrgeb	(	vui8_t	vra,
		vui8_t	vrb
	)

inlinestatic

Vector Merge Even Bytes operation.

Merge the even byte elements from the concatenation of 2 x vectors (vra and vrb).

Note: The element numbering changes between Big and Little Endian. So the compiler and this implementation adjusts the generated code to reflect this.

processor	Latency	Throughput
power8	2-13	2/cycle
power9	3-14	2/cycle

Parameters

vra	128-bit vector unsigned char.
vrb	128-bit vector unsigned char.

Returns: A vector merge from only the even bytes of vra and vrb.

◆ vec_mrgob()

static vui8_t vec_mrgob	(	vui8_t	vra,
		vui8_t	vrb
	)

inlinestatic

Vector Merge Odd Halfwords operation.

Merge the odd halfword elements from the concatenation of 2 x vectors (vra and vrb).

Note: The element numbering changes between Big and Little Endian. So the compiler and this implementation adjusts the generated code to reflect this.

processor	Latency	Throughput
power8	2-13	2/cycle
power9	3-14	2/cycle

Parameters

vra	128-bit vector unsigned char.
vrb	128-bit vector unsigned char.

Returns: A vector merge from only the odd bytes of vra and vrb.

◆ vec_mulhsb()

static vi8_t vec_mulhsb	(	vi8_t	vra,
		vi8_t	vrb
	)

inlinestatic

Vector Multiply High Signed Bytes.

Multiple the corresponding byte elements of two vector signed char values and return the high order 8-bits, for each 16-bit product element.

processor	Latency	Throughput
power8	9-13	1/cycle
power9	10-14	1/cycle

Parameters

vra	128-bit vector signed char.
vrb	128-bit vector signed char.

Returns: vector of the high order 8-bits of the product of the byte elements from vra and vrb.

◆ vec_mulhub()

static vui8_t vec_mulhub	(	vui8_t	vra,
		vui8_t	vrb
	)

inlinestatic

Vector Multiply High Unsigned Bytes.

Multiple the corresponding byte elements of two vector unsigned char values and return the high order 8-bits, for each 16-bit product element.

processor	Latency	Throughput
power8	9-13	1/cycle
power9	10-14	1/cycle

Parameters

vra	128-bit vector unsigned char.
vrb	128-bit vector unsigned char.

Returns: vector of the high order 8-bits of the product of the byte elements from vra and vrb.

◆ vec_mulubm()

static vui8_t vec_mulubm	(	vui8_t	vra,
		vui8_t	vrb
	)

inlinestatic

Vector Multiply Unsigned Byte Modulo.

Multiple the corresponding byte elements of two vector unsigned char values and return the low order 8-bits of the 16-bit product for each element.

Note: vec_mulubm can be used for unsigned or signed char integers. It is the vector equivalent of Multiply Low Byte.

processor	Latency	Throughput
power8	9-13	1/cycle
power9	10-14	1/cycle

Parameters

vra	128-bit vector unsigned char.
vrb	128-bit vector unsigned char.

Returns: vector of the low order 8-bits of the unsigned product of the byte elements from vra and vrb.

◆ vec_popcntb()

static vui8_t vec_popcntb ( vui8_t vra )

inlinestatic

Vector Population Count byte.

Count the number of '1' bits (0-8) within each byte element of a 128-bit vector.

For POWER8 (PowerISA 2.07B) or later use the Vector Population Count Byte instruction. Otherwise use simple Vector (VMX) instructions to count bits in bytes in parallel. SIMDized population count inspired by:

Warren, Henry S. Jr and Hacker's Delight, 2nd Edition, Addison Wesley, 2013. Chapter 5 Counting Bits, Figure 5-2.

processor	Latency	Throughput
power8	2	2/cycle
power9	3	2/cycle

Parameters

vra	128-bit vector treated as 16 x 8-bit integers (byte) elements.

Returns: 128-bit vector with the population count for each byte element.

◆ vec_setb_sb()

static vb8_t vec_setb_sb ( vi8_t vra )

inlinestatic

Vector Set Bool from Signed Byte.

For each byte, propagate the sign bit to all 8-bits of that byte. The result is vector bool char reflecting the sign bit of each 8-bit byte.

processor	Latency	Throughput
power8	2-4	2/cycle
power9	2-5	2/cycle

Parameters

vra	Vector signed char.

Returns: vector bool char reflecting the sign bit of each byte.

◆ vec_shift_leftdo()

static vui8_t vec_shift_leftdo	(	vui8_t	vrw,
		vui8_t	vrx,
		vui8_t	vrb
	)

inlinestatic

Shift left double quadword by octet. Return a vector unsigned char that is the left most 16 chars after shifting left 0-15 octets (chars) of the 32 char double vector (vrw||vrx). The octet shift amount is from bits 121:124 of vrb.

This sequence can be used to align a unaligned 16 char substring based on the result of a vector count leading zero of of the compare boolean.

processor	Latency	Throughput
power8	6-8	1/cycle
power9	8-9	1/cycle

Parameters

vrw	upper 16-bytes of the 32-byte double vector.
vrx	lower 16-bytes of the 32-byte double vector.
vrb	Shift amount in bits 121:124.

Returns: upper 16-bytes of left shifted double vector.

◆ vec_slbi()

static vui8_t vec_slbi	(	vui8_t	vra,
		const unsigned int	shb
	)

inlinestatic

Vector Shift left Byte Immediate.

Shift left each byte element [0-15], 0-7 bits, as specified by an immediate value. The shift amount is a const unsigned int in the range 0-7. A shift count of 0 returns the original value of vra. Shift counts greater then 7 bits return zero.

processor	Latency	Throughput
power8	4-11	2/cycle
power9	5-11	2/cycle

Parameters

vra	a 128-bit vector treated as a vector unsigned char.
shb	Shift amount in the range 0-7.

Returns: 128-bit vector unsigned char, shifted left shb bits.

◆ vec_srabi()

static vi8_t vec_srabi	(	vi8_t	vra,
		const unsigned int	shb
	)

inlinestatic

Vector Shift Right Algebraic Byte Immediate.

Shift right each byte element [0-15], 0-7 bits, as specified by an immediate value. The shift amount is a const unsigned int in the range 0-7. A shift count of 0 returns the original value of vra. Shift counts greater then 7 bits return the sign bit propagated to each bit of each element.

processor	Latency	Throughput
power8	4-11	2/cycle
power9	5-11	2/cycle

Parameters

vra	a 128-bit vector treated as a vector signed char.
shb	Shift amount in the range 0-7.

Returns: 128-bit vector signed char, shifted right shb bits.

◆ vec_srbi()

static vui8_t vec_srbi	(	vui8_t	vra,
		const unsigned int	shb
	)

inlinestatic

Vector Shift Right Byte Immediate.

Shift right each byte element [0-15], 0-7 bits, as specified by an immediate value. The shift amount is a const unsigned int in the range 0-7. A shift count of 0 returns the original value of vra. Shift counts greater then 7 bits return zero.

processor	Latency	Throughput
power8	4-11	2/cycle
power9	5-11	2/cycle

Parameters

vra	a 128-bit vector treated as a vector unsigned char.
shb	Shift amount in the range 0-7.

Returns: 128-bit vector unsigned char, shifted right shb bits.

◆ vec_tolower()

static vui8_t vec_tolower ( vui8_t vec_str )

inlinestatic

Vector tolower.

Convert any Upper Case Alpha ASCII characters within a vector unsigned char into the equivalent Lower Case character. Return the result as a vector unsigned char.

processor	Latency	Throughput
power8	8-17	1/cycle
power9	9-18	1/cycle

Parameters

vec_str vector of 16 ASCII characters

Returns: vector char converted to lower case.

◆ vec_toupper()

static vui8_t vec_toupper ( vui8_t vec_str )

inlinestatic

Vector toupper.

Convert any Lower Case Alpha ASCII characters within a vector unsigned char into the equivalent Upper Case character. Return the result as a vector unsigned char.

processor	Latency	Throughput
power8	8-17	1/cycle
power9	9-18	1/cycle

Parameters

vec_str vector of 16 ASCII characters

Returns: vector char converted to upper case.

◆ vec_vmrgeb()

static vui8_t vec_vmrgeb	(	vui8_t	vra,
		vui8_t	vrb
	)

inlinestatic

Vector Merge Even Bytes.

Merge the even byte elements from the concatenation of 2 x vectors (vra and vrb).

Note

This function implements the operation of a Vector Merge Even Bytes instruction, if the PowerISA included such an instruction. This implementation is NOT Endian sensitive and the function is stable across BE/LE implementations. Using Big Endian element numbering:

res[0] = vra[0];
res[1] = vrb[0];
res[2] = vra[2];
res[3] = vrb[2];
res[4] = vra[4];
res[5] = vrb[4];
res[6] = vra[6];
res[7] = vrb[6];
res[8] = vra[8];
res[9] = vrb[8];
res[10] = vra[10];
res[11] = vrb[10];
res[12] = vra[12];
res[13] = vrb[12];
res[14] = vra[14];
res[15] = vrb[14];

processor	Latency	Throughput
power8	2-13	2/cycle
power9	3-14	2/cycle

Parameters

vra	128-bit vector unsigned char.
vrb	128-bit vector unsigned char.

Returns: A vector merge from only the even bytes of vra and vrb.

◆ vec_vmrgob()

static vui8_t vec_vmrgob	(	vui8_t	vra,
		vui8_t	vrb
	)

inlinestatic

Vector Merge Odd Byte.

Merge the odd byte elements from the concatenation of 2 x vectors (vra and vrb).

Note

This function implements the operation of a Vector Merge Odd Bytes instruction, if the PowerISA included such an instruction. This implementation is NOT Endian sensitive and the function is stable across BE/LE implementations. Using Big Endian element numbering:

res[0] = vra[1];
res[1] = vrb[1];
res[2] = vra[3];
res[3] = vrb[3];
res[4] = vra[5];
res[5] = vrb[5];
res[6] = vra[7];
res[7] = vrb[7];
res[8] = vra[9];
res[9] = vrb[9];
res[10] = vra[11];
res[11] = vrb[11];
res[12] = vra[13];
res[13] = vrb[13];
res[14] = vra[15];
res[15] = vrb[15];

processor	Latency	Throughput
power8	2-13	2/cycle
power9	3-14	2/cycle

Parameters

vra	128-bit vector unsigned char.
vrb	128-bit vector unsigned char.

Returns: A vector merge from only the odd bytes of vra and vrb.

Functions

Detailed Description

Endian problems with byte operations

Performance data.

More information.

Function Documentation

◆ vec_absdub()

◆ vec_clzb()

◆ vec_ctzb()

◆ vec_isalnum()

◆ vec_isalpha()

◆ vec_isdigit()

◆ vec_mrgahb()

◆ vec_mrgalb()

◆ vec_mrgeb()

◆ vec_mrgob()

◆ vec_mulhsb()

◆ vec_mulhub()

◆ vec_mulubm()

◆ vec_popcntb()

◆ vec_setb_sb()

◆ vec_shift_leftdo()

◆ vec_slbi()

◆ vec_srabi()

◆ vec_srbi()

◆ vec_tolower()

◆ vec_toupper()

◆ vec_vmrgeb()

◆ vec_vmrgob()