Header package containing a collection of 128-bit SIMD operations over 16-bit integer elements. More...

#include <pveclib/vec_char_ppc.h>

Functions
static vui16_t	vec_absduh (vui16_t vra, vui16_t vrb)
	Vector Absolute Difference Unsigned halfword. More...

static vui16_t	vec_clzh (vui16_t vra)
	Vector Count Leading Zeros Halfword for unsigned short elements. More...

static vui16_t	vec_ctzh (vui16_t vra)
	Vector Count Trailing Zeros Halfword for unsigned short elements. More...

static vui16_t	vec_mrgahh (vui32_t vra, vui32_t vrb)
	Vector Merge Algebraic High Halfword operation. More...

static vui16_t	vec_mrgalh (vui32_t vra, vui32_t vrb)
	Vector Merge Algebraic Low Halfword operation. More...

static vui16_t	vec_mrgeh (vui16_t vra, vui16_t vrb)
	Vector Merge Even Halfwords operation. More...

static vui16_t	vec_mrgoh (vui16_t vra, vui16_t vrb)
	Vector Merge Odd Halfwords operation. More...

static vi16_t	vec_mulhsh (vi16_t vra, vi16_t vrb)
	Vector Multiply High Signed halfword. More...

static vui16_t	vec_mulhuh (vui16_t vra, vui16_t vrb)
	Vector Multiply High Unsigned halfword. More...

static vui16_t	vec_muluhm (vui16_t vra, vui16_t vrb)
	Vector Multiply Unsigned halfword Modulo. More...

static vui16_t	vec_popcnth (vui16_t vra)
	Vector Population Count halfword. More...

static vui16_t	vec_revbh (vui16_t vra)
	byte reverse each halfword of a vector unsigned short. More...

static vb16_t	vec_setb_sh (vi16_t vra)
	Vector Set Bool from Signed Halfword. More...

static vui16_t	vec_slhi (vui16_t vra, const unsigned int shb)
	Vector Shift left Halfword Immediate. More...

static vui16_t	vec_srhi (vui16_t vra, const unsigned int shb)
	Vector Shift Right Halfword Immediate. More...

static vi16_t	vec_srahi (vi16_t vra, const unsigned int shb)
	Vector Shift Right Algebraic Halfword Immediate. More...

static vui32_t	vec_vmaddeuh (vui16_t a, vui16_t b, vui16_t c)
	Vector Multiply-Add Even Unsigned Halfwords. More...

static vui32_t	vec_vmaddouh (vui16_t a, vui16_t b, vui16_t c)
	Vector Multiply-Add Odd Unsigned Halfwords. More...

static vui16_t	vec_vmrgeh (vui16_t vra, vui16_t vrb)
	Vector Merge Even Halfwords. More...

static vui16_t	vec_vmrgoh (vui16_t vra, vui16_t vrb)
	Vector Merge Odd Halfwords. More...

Detailed Description

Header package containing a collection of 128-bit SIMD operations over 16-bit integer elements.

Most of these operations are implemented in a single instruction on newer (POWER6/POWER7POWER8/POWER9) processors. This header serves to fill in functional gaps for older (POWER7, POWER8) processors and provides a in-line assembler implementation for older compilers that do not provide the build-ins.

Most vector short (16-bit integer halfword) operations are implemented with PowerISA VMX instructions either defined by the original VMX (AKA Altivec) or added to later versions of the PowerISA. PowerISA 2.07B (POWER8) added several useful halfword operations (count leading zeros, population count) not included in the original VMX. PowerISA 3.0B (POWER9) adds several more (absolute difference, compare not equal, count trailing zeros, extend sign, extract/insert, and reverse bytes). Most of these intrinsic (compiler built-ins) operations are defined in <altivec.h> and described in the compiler documentation.

Note: The compiler disables associated <altivec.h> built-ins if the mcpu target does not enable the specific instruction. For example if you compile with -mcpu=power7, vec_vclz and vec_vclzh will not be defined. Another example if you compile with -mcpu=power8, vec_revb will not be defined. But vec_vclzh and vec_revbh is always defined in this header. This header provides the appropriate substitutions, will generate the minimum code, appropriate for the target, and produce correct results.; Most ppc64le compilers will default to -mcpu=power8 if not specified.

This header covers operations that are either:

Implemented in hardware instructions for later processors and useful to programmers, on slightly older processors, even if the equivalent function requires more instructions. Examples include Count Leading Zeros, Population Count and Byte Reverse.
Defined in the OpenPOWER ABI but not yet defined in <altivec.h> provided by available compilers in common use. Examples include Count Leading Zeros, Population Count and Byte Reverse.
Commonly used operations, not covered by the ABI or <altivec.h>, and require multiple instructions or are not obvious. Examples include the multiply-add, multiply-high and shift immediate operations.

Recent Additions

Added vec_vmaddeuh() and vec_vmaddouh() as an optimization for the vector multiply quadword implementations on POWER7.

Endian problems with halfword operations

It would be useful to provide a vector multiply high halfword (return the high order 16-bits of the 32-bit product) operation. This can be used for multiplicative inverse (effectively integer divide) operations. Neither integer multiply high nor divide are available as vector instructions. However the multiply high halfword operation can be composed from the existing multiply even/odd halfword operations followed by the vector merge even halfword operation. Similarly a multiply low (modulo) halfword operation can be composed from the existing multiply even/odd halfword operations followed by the vector merge odd halfword operation.

As a prerequisite we need to provide the merge even/odd halfword operations. While PowerISA has added these operations for word and doubleword, instructions are nor defined for byte and halfword. Fortunately vector merge operations are just a special case of vector permute. So the vec_vmrgoh() and vec_vmrgeh() implementation can use vec_perm and appropriate selection vectors to provide these merge operations.

But this is complicated by little-endian (LE) support as specified in the OpenPOWER ABI and as implemented in the compilers. Little-endian changes the effective vector element numbering and the location of even and odd elements. This means that the vector built-ins provided by altivec.h may not generate the instructions you would expect.

See also: General Endian Issues; Endian problems with word operations

The OpenPOWER ABI provides a helpful table of Endian Sensitive Operations. For for vec_mule (vmuleuh, vmulesh):

Replace with vmulouh and so on, for LE.

For for vec_mulo (vmulouh, vmulosh):

Replace with vmuleuh and so on, for LE.

Also for vec_perm (vperm) it specifies:

For LE, Swap input arguments and complement the selection vector.

The above is just a sampling of a larger list of Endian Sensitive Operations.

The obvious coding for Vector Multiply High Halfword would be:

vui16_t
test_mulhw (vui16_t vra, vui16_t vrb)
{
  return vec_mergee ((vui16_t)vec_mule (vra, vrb),
                     (vui16_t)vec_mulo (vra, vrb));
}

A couple problems with this:

vec_mergee is only defined for vector int/long and float/double (word/doubleword) types.
vec_mergee is endian sensitive and would produce the wrong results in LE mode.
vec_mule/vec_mulo are endian sensitive and produce the wrong results in LE mode.

The first step is to implement Vector Merge Even Halfword operation:

static inline vui16_t
vec_vmrgeh (vui16_t vra, vui16_t vrb)
{
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
  vui16_t permute =
      { 0x0302,0x1312, 0x0706,0x1716, 0x0B0A,0x1B1A, 0x0F0E,0x1F1E };
 
  return vec_perm (vrb, vra, (vui8_t)permute);
#else
  vui16_t permute =
      { 0x0001,0x1011, 0x0405,0x1415, 0x0809,0x1819, 0x0C0D,0x1C1D};
 
  return vec_perm (vra, vrb, (vui8_t)permute);
#endif
}

For big-endian we have a straight forward vec_perm with a permute select vector interleaving even halfwords from vectors vra and vrb.

For little-endian we need to nullify the LE transform applied by the compiler. So the select vector looks like it interleaves odd halfwords from vectors vrb and vra. It also reverses byte numbering within halfwords. The compiler transforms this back into the operation we wanted in the first place. The result is not endian sensitive and is stable across BE/LE implementations. Similarly for the Vector Merge Odd Halfword operation.

As good OpenPOWER ABI citizens we should also provide endian sensitive operations vec_mrgeh() vec_mrgoh(). For example:

static inline vui16_t
vec_mrgeh  (vui16_t vra, vui16_t vrb)
{
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
  return vec_vmrgoh ((vui16_t) vrb, (vui16_t) vra);
#else
  return vec_vmrgeh ((vui16_t) vra, (vui16_t) vrb);
#endif
}

Note: This is essentially what the compiler would do for vec_mergee.

Also to follow that pattern established for vec_int32_ppc.h we should provide implementations for Vector Merge Algebraic High/Low Halfword. For example:

static inline vui16_t
vec_mrgahh  (vui32_t vra, vui32_t vrb)
{
  return vec_vmrgeh ((vui16_t) vra, (vui16_t) vrb);
}

This is simpler as we can use the endian invariant vec_vmrgeh() operation. Similarly for Vector Merge Algebraic Low Halfword using vec_vmrgoh().

Note: The inputs are defined as 32-bit to match the results from multiply even/odd halfword.

Now we have all the parts we need to implement multiply high/low halfword. For example Multiply High Unsigned Halfword:

static inline vui16_t
vec_mulhuh (vui16_t vra, vui16_t vrb)
{
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
  return vec_mrgahh (vec_mulo (vra, vrb), vec_mule (vra, vrb));
#else
  return vec_mrgahh (vec_mule (vra, vrb), vec_mulo (vra, vrb));
#endif
}

Similarly for Multiply High Signed Halfword.

Note: For LE we need to nullify the compiler transform by reversing of the order of vec_mulo/vec_mule. This is required to get the algebraically correct (multiply high) result.

Finally we can implement the Multiply Low Halfword which by PowerISA conventions is called Multiply Unsigned Halfword Modulo:

static inline vui16_t
vec_muluhm (vui16_t vra, vui16_t vrb)
{
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
  return vec_mrgalh (vec_mulo (vra, vrb), vec_mule (vra, vrb));
#else
  return vec_mrgalh (vec_mule (vra, vrb), vec_mulo (vra, vrb));
#endif
}

Note: We use the endian stable vec_mrgalh() for multiply low. Again for LE we have to nullify the compiler transform by reversing of the order of vec_mulo/vec_mule. This is required to get the algebraically correct (multiply high) result.; vec_muluhm() works for signed and unsigned multiply low (modulo).

Multiply High Unsigned Halfword Example

So what does the compiler generate after unwinding three levels of inline functions. For this test case:

vui16_t
__test_mulhuh (vui16_t a, vui16_t b)
{
  return vec_mulhuh (a, b);
}

The GCC 8 compiler targeting powerpc64le and -mcpu=power8 generates:

addis   r9,r2,.rodata.cst16@ha
vmulouh v1,v2,v3
vmuleuh v2,v2,v3
addi    r9,r9,.rodata.cst16@l
lvx     v0,0,r9
xxlnor  vs32,vs32,vs32
vperm   v2,v2,v1,v0

The addis, addi, lvx instruction sequence loads the permute selection constant vector. The xxlnor instruction complements the selection vector for LE. These instructions are only needed once per function and can be hoisted out of loops and shared across instances of vec_mulhuh(). Which might look like this:

      addis   r9,r2,.rodata.cst16@ha
      addi    r9,r9,.rodata.cst16@l
      lvx     v0,0,r9
      xxlnor  vs32,vs32,vs32
      ...
Loop:
      vmulouh v1,v2,v3
      vmuleuh v2,v2,v3
      vperm   v2,v2,v1,v0
      ...

The vmulouh, vmuleuh, vperm instruction sequence is the core of the function. They multiply the elements and selects/merges the high order 16-bits of each product into the result vector.

Examples, Divide by integer constant

Suppose we have a requirement to convert an array of 16-bit unsigned short values to decimal. The classic itoa implementation performs a sequence of divide / modulo by 10 operations that produce one (decimal) value per iteration, until the divide returns 0.

For this example we want to vectorize the operation and the PowerISA (and most other platforms) does not provide a vector integer divide instruction. But we do have vector integer multiply. As we will see the multiply high defined above is very handy for applying the multiplicative inverse. Also, the conversion divide is a constant value applied across the vector which simplifies the coding.

Here we can use the multiplicative inverse which is a scaled fixed point fraction calculated from the original divisor. This works nicely if the fixed radix point is just before the 16-bit fraction and we have a multiply high (vec_mulhuh()) operation. Multiplying a 16-bit unsigned integer by a 16-bit unsigned fraction generates a 32-bit product with 16-bits above (integer) and below (fraction) the radix point. The high 16-bits of the product is a good approximation of the integer quotient.

It turns out that generating the multiplicative inverse can be tricky. To produce correct results over the full range, requires possible pre-scaling and post-shifting, and sometimes a corrective addition. Fortunately, the mathematics are well understood and are commonly used in optimizing compilers. Even better, Henry Warren's book has a whole chapter on this topic.

See also: "Hacker's Delight, 2nd Edition," Henry S. Warren, Jr, Addison Wesley, 2013. Chapter 10, Integer Division by Constants.

Divide by constant 10 examples

In the chapter above;

Figure 10-2 Computing the magic number for unsigned division.

provides a sample C function for generating the magic number (actually a struct containing; the magic multiplicative inverse, "add" indicator, and the shift amount). For the 16-bit unsigned divisor 10, this is { 52429, 0, 3 }:

the multiplier is 52429.
no corrective add of the dividend is required.
the final shift is 3-bits right.

Which could look like this:

 vui16_t
__test_div10 (vui16_t n)
{
  vui16_t q;
  // M= 52429, a=0, s=3
  vui16_t magic = vec_splats ((unsigned short) 52429);
  const int s = 3;
 
  q = vec_mulhuh (magic, n);
  return vec_srhi (q, s);
}

But we also need the modulo to extract each digit. The simplest and oldest technique is to multiply the quotient by the divisor (constant 10) and subtract that from the original dividend. Here we can use the vec_muluhm() operation we defined above. Which could look like this:

 vui16_t
__test_mod10 (vui16_t n)
{
  vui16_t q;
  // M= 52429, a=0, s=3
  vui16_t magic = vec_splats ((unsigned short) 52429);
  vui16_t c_10 = vec_splats ((unsigned short) 10);
  const int s = 3;
  vui16_t tmp, rem, q_10;
 
  q = vec_mulhuh (magic, n);
  q_10 = vec_srhi (q, s);
  tmp = vec_muluhm (q_10, c_10);
  rem = vec_sub (n, tmp);
  return rem;
}

Note: vec_sub() and vec_splats() are an existing altivec.h generic built-ins.

Divide by constant 10000 example

As we mentioned above, some divisors require an add before the shift as a correction. For the 16-bit unsigned divisor 10000 this is { 41839, 1, 14 }:

the multiplier is 41839.
corrective add of the dividend is required.
the final shift is 14-bits right.

In this case the perfect multiplier is too large (>= 2**16). So the magic multiplier is reduced by 2**16 and to correct for this we need to add the dividend to the product. This add may generate a carry that must be included in the shift. Here vec_avg handles the 17-bit sum internally before shifting right 1. But vec_avg adds an extra +1 (for rounding) that we don't want. So we use (n-1) for the product correction then complete the operation with shift right (s-1). Which could look like this:

vui16_t
__test_div10000 (vui16_t n)
{
  vui16_t result, q;
  // M= 41839, a=1, s=14
  vui16_t magic = vec_splats ((unsigned short) 41839);
  const int s = 14;
  vui16_t tmp, rem;
 
  q = vec_mulhuh (magic, n);
    {
      const vui16_t vec_ones = vec_splat_u16 ( 1 );
      vui16_t n_1 = vec_sub (n, vec_ones);
      // avg = (q + (n-1) + 1) >> 1
      q = vec_avg (q, n_1);
      result = vec_srhi (q, (s - 1));
    }
  return result;
}

Note: vec_avg(), vec_sub(), vec_splats() and vec_splat_u16() are existing altivec.h generic built-ins.

The modulo computation remains the same as Divide by constant 10 examples.

Performance data.

We can use the example above (see Multiply High Unsigned Halfword Example) to illustrate the performance metrics pveclib provides. For vec_mulhuh() the core operation is the sequence vmulouh/vmuleuh/vperm. This represents the best case latency, when it is used multiple times in a single larger function.

The compiler notes that vmulouh/vmuleuh are independent instructions that can execute concurrently (in separate vector pipelines). The compiler schedules them to issue in same cycle. The latency for vmulouh/vmuleuh is listed as 7 cycle and the throughput of 2 per cycle (there are 2 vector pipes for multiply). As we assume this function will use both vector pipelines, the throughput for this function is reduced to 1 per cycle.

We still need to select/merge the results. The vperm instruction is dependent on the execution of both vmulouh/vmuleuh and load of the select vector complete. For this case we assume that the load of the permute select vector has already executed. The processor can not issue the vperm until both vmulouh/vmuleuh instructions execute. The latency for vperm is 2 cycles (3 on POWER9). So the best case latency for this operation is is (7 + 2 = 9) cycles (10 on POWER9).

Looking at the first or only execution of vec_mulhuh() in a function defines the worse case latency. Here we have to include the permute select vector load and (for LE) the select vector complement. However this case provides additional multiple pipe parallelism that needs to be accounted for in the latencies.

The compiler notes that addis/vmulouh/vmuleuh are independent instructions that can execute concurrently in separate pipelines. So the compiler schedules them to issue in same cycle. The latency for vmulouh/vmuleuh is 7 cycles while the addis latency is only 2 cycles. The dependent addi instruction can issue in the 3rd cycle, while vmulouh/vmuleuh are still executing. The addi also has a 2 cycle latency, so the dependent lvx can issue in the 5th cycle, while vmulouh/vmuleuh are still executing. The lvx has a latency of 5 cycles and will not complete execution until 2 cycles after vmulouh/vmuleuh. The dependent xxlnor is waiting of the load (lvx) and has a latency of 2 cycles.

So there are two independent instruction sequences; vmulouh/vmuleuh and addis/addi/lvx/xxlnor. Both must complete execution before the vperm can issue and complete the operation. The later sequence has the longer (2+2+5+2=11) latency and dominates the timing. So the worst latency for the full sequence is (2+2+5+2+2 = 13) cycles (14 on POWER9).

processor	Latency	Throughput
power8	9-13	1/cycle
power9	10-14	1/cycle

More information.

High level performance estimates are provided as an aid to function selection when evaluating algorithms. For background on how Latency and Throughput are derived see: Performance data.

Function Documentation

◆ vec_absduh()

static vui16_t vec_absduh	(	vui16_t	vra,
		vui16_t	vrb
	)

inlinestatic

Vector Absolute Difference Unsigned halfword.

Compute the absolute difference for each halfword. For each unsigned halfword, subtract VRB[i] from VRA[i] and return the absolute value of the difference.

processor	Latency	Throughput
power8	4	1/cycle
power9	3	2/cycle

Parameters

vra	vector of 8 x unsigned halfword
vrb	vector of 8 x unsigned halfword

Returns: vector of the absolute differences.

◆ vec_clzh()

static vui16_t vec_clzh ( vui16_t vra )

inlinestatic

Vector Count Leading Zeros Halfword for unsigned short elements.

Count the number of leading '0' bits (0-16) within each halfword element of a 128-bit vector.

For POWER8 (PowerISA 2.07B) or later use the Vector Count Leading Zeros Halfword instruction vclzh. Otherwise use sequence of pre 2.07 VMX instructions.

Note: SIMDized count leading zeros inspired by: Warren, Henry S. Jr and Hacker's Delight, 2nd Edition, Addison Wesley, 2013. Chapter 5 Counting Bits, Figure 5-12.

processor	Latency	Throughput
power8	2	2/cycle
power9	3	2/cycle

Parameters

vra	128-bit vector treated as 8 x 16-bit unsigned integer (halfword) elements.

Returns: 128-bit vector with the leading zeros count for each halfword element.

◆ vec_ctzh()

static vui16_t vec_ctzh ( vui16_t vra )

inlinestatic

Vector Count Trailing Zeros Halfword for unsigned short elements.

Count the number of trailing '0' bits (0-16) within each halfword element of a 128-bit vector.

For POWER9 (PowerISA 3.0B) or later use the Vector Count Trailing Zeros Halfword instruction vctzh. Otherwise use a sequence of pre ISA 3.0 VMX instructions. SIMDized count trailing zeros inspired by:

Warren, Henry S. Jr and Hacker's Delight, 2nd Edition, Addison Wesley, 2013. Chapter 5 Counting Bits, Section 5-4.

processor	Latency	Throughput
power8	6-8	2/cycle
power9	3	2/cycle

Parameters

vra	128-bit vector treated as 8 x 16-bit unsigned short integer (halfwords) elements.

Returns: 128-bit vector with the trailing zeros count for each halfword element.

◆ vec_mrgahh()

static vui16_t vec_mrgahh	(	vui32_t	vra,
		vui32_t	vrb
	)

inlinestatic

Vector Merge Algebraic High Halfword operation.

Merge only the high halfwords from 8 x Algebraic words across vectors vra and vrb. This is effectively the Vector Merge Even Halfword operation that is not modified for endian.

For example merge the high 16-bits from each of 8 x 32-bit products as generated by vec_muleuh/vec_mulouh. This result is effectively a vector multiply high unsigned halfword.

processor	Latency	Throughput
power8	2-13	2/cycle
power9	3-14	2/cycle

Parameters

vra	128-bit vector unsigned int.
vrb	128-bit vector unsigned int.

Returns: A vector merge from only the high halfwords of the 8 x Algebraic words across vra and vrb.

◆ vec_mrgalh()

static vui16_t vec_mrgalh	(	vui32_t	vra,
		vui32_t	vrb
	)

inlinestatic

Vector Merge Algebraic Low Halfword operation.

Merge only the low halfwords from 8 x Algebraic words across vectors vra and vrb. This is effectively the Vector Merge Odd Halfword operation that is not modified for endian.

For example merge the low 16-bits from each of 8 x 32-bit products as generated by vec_muleuh/vec_mulouh. This result is effectively a vector multiply low unsigned halfword.

processor	Latency	Throughput
power8	2-13	2/cycle
power9	3-14	2/cycle

Parameters

vra	128-bit vector unsigned int.
vrb	128-bit vector unsigned int.

Returns: A vector merge from only the high halfwords of the 8 x Algebraic words across vra and vrb.

◆ vec_mrgeh()

static vui16_t vec_mrgeh	(	vui16_t	vra,
		vui16_t	vrb
	)

inlinestatic

Vector Merge Even Halfwords operation.

Merge the even halfword elements from the concatenation of 2 x vectors (vra and vrb).

Note: The element numbering changes between big and little-endian. So the compiler and this implementation adjusts the generated code to reflect this.

processor	Latency	Throughput
power8	2-13	2/cycle
power9	3-14	2/cycle

Parameters

vra	128-bit vector unsigned short.
vrb	128-bit vector unsigned short.

Returns: A vector merge from only the even halfwords of vra and vrb.

◆ vec_mrgoh()

static vui16_t vec_mrgoh	(	vui16_t	vra,
		vui16_t	vrb
	)

inlinestatic

Vector Merge Odd Halfwords operation.

Merge the odd halfword elements from the concatenation of 2 x vectors (vra and vrb).

Note: The element numbering changes between big and little-endian. So the compiler and this implementation adjusts the generated code to reflect this.

processor	Latency	Throughput
power8	2-13	2/cycle
power9	3-14	2/cycle

Parameters

vra	128-bit vector unsigned short.
vrb	128-bit vector unsigned short.

Returns: A vector merge from only the odd halfwords of vra and vrb.

◆ vec_mulhsh()

static vi16_t vec_mulhsh	(	vi16_t	vra,
		vi16_t	vrb
	)

inlinestatic

Vector Multiply High Signed halfword.

Multiple the corresponding halfword elements of two vector signed short values and return the high order 16-bits, for each 32-bit product element.

processor	Latency	Throughput
power8	9-13	1/cycle
power9	10-14	1/cycle

Parameters

vra	128-bit vector signed short.
vrb	128-bit vector signed short.

Returns: vector of the high order 16-bits of the product of the halfword elements from vra and vrb.

◆ vec_mulhuh()

static vui16_t vec_mulhuh	(	vui16_t	vra,
		vui16_t	vrb
	)

inlinestatic

Vector Multiply High Unsigned halfword.

Multiply the corresponding halfword elements of two vector unsigned short values and return the high order 16-bits, for each 32-bit product element.

processor	Latency	Throughput
power8	9-13	1/cycle
power9	10-14	1/cycle

Parameters

vra	128-bit vector unsigned short.
vrb	128-bit vector unsigned short.

Returns: vector of the high order 16-bits of the product of the halfword elements from vra and vrb.

◆ vec_muluhm()

static vui16_t vec_muluhm	(	vui16_t	vra,
		vui16_t	vrb
	)

inlinestatic

Vector Multiply Unsigned halfword Modulo.

Multiply the corresponding halfword elements of two vector unsigned short values and return the low order 16-bits of the 32-bit product for each element.

Note: vec_muluhm can be used for unsigned or signed short integers. It is the vector equivalent of Multiply Low Halfword.

processor	Latency	Throughput
power8	9-13	1/cycle
power9	10-14	1/cycle

Parameters

vra	128-bit vector unsigned short.
vrb	128-bit vector unsigned short.

Returns: vector of the low order 16-bits of the unsigned product of the halfword elements from vra and vrb.

◆ vec_popcnth()

static vui16_t vec_popcnth ( vui16_t vra )

inlinestatic

Vector Population Count halfword.

Count the number of '1' bits (0-16) within each byte element of a 128-bit vector.

For POWER8 (PowerISA 2.07B) or later use the Vector Population Count Halfword instruction. Otherwise use simple Vector (VMX) instructions to count bits in bytes in parallel.

Note: SIMDized population count inspired by: Warren, Henry S. Jr and Hacker's Delight, 2nd Edition, Addison Wesley, 2013. Chapter 5 Counting Bits, Figure 5-2.

processor	Latency	Throughput
power8	2	2/cycle
power9	3	2/cycle

Parameters

vra	128-bit vector treated as 8 x 16-bit integers (halfword) elements.

Returns: 128-bit vector with the population count for each halfword element.

◆ vec_revbh()

static vui16_t vec_revbh ( vui16_t vra )

inlinestatic

byte reverse each halfword of a vector unsigned short.

For each halfword of the input vector, reverse the order of bytes / octets within the halfword.

processor	Latency	Throughput
power8	2-11	2/cycle
power9	3	2/cycle

Parameters

vra	a 128-bit vector unsigned short.

Returns: a 128-bit vector with the bytes of each halfword reversed.

◆ vec_setb_sh()

static vb16_t vec_setb_sh ( vi16_t vra )

inlinestatic

Vector Set Bool from Signed Halfword.

For each halfword, propagate the sign bit to all 16-bits of that halfword. The result is vector bool short reflecting the sign bit of each 16-bit halfword.

processor	Latency	Throughput
power8	2-4	2/cycle
power9	2-5	2/cycle

Parameters

vra	Vector signed short.

Returns: vector bool short reflecting the sign bit of each halfword.

◆ vec_slhi()

static vui16_t vec_slhi	(	vui16_t	vra,
		const unsigned int	shb
	)

inlinestatic

Vector Shift left Halfword Immediate.

Shift left each halfword element [0-7], 0-15 bits, as specified by an immediate value. The shift amount is a const unsigned int in the range 0-15. A shift count of 0 returns the original value of vra. Shift counts greater then 15 bits return zero.

processor	Latency	Throughput
power8	4-11	2/cycle
power9	5-11	2/cycle

Parameters

vra	a 128-bit vector treated as a vector unsigned short.
shb	Shift amount in the range 0-15.

Returns: 128-bit vector unsigned short, shifted left shb bits.

◆ vec_srahi()

static vi16_t vec_srahi	(	vi16_t	vra,
		const unsigned int	shb
	)

inlinestatic

Vector Shift Right Algebraic Halfword Immediate.

Shift right algebraic each halfword element [0-7], 0-15 bits, as specified by an immediate value. The shift amount is a const unsigned int in the range 0-15. A shift count of 0 returns the original value of vra. Shift counts greater then 7 bits return the sign bit propagated to each bit of each element.

processor	Latency	Throughput
power8	4-11	2/cycle
power9	5-11	2/cycle

Parameters

vra	a 128-bit vector treated as a vector signed char.
shb	Shift amount in the range 0-7.

Returns: 128-bit vector signed short, shifted right shb bits.

◆ vec_srhi()

static vui16_t vec_srhi	(	vui16_t	vra,
		const unsigned int	shb
	)

inlinestatic

Vector Shift Right Halfword Immediate.

Shift right each halfword element [0-7], 0-15 bits, as specified by an immediate value. The shift amount is a const unsigned int in the range 0-15. A shift count of 0 returns the original value of vra. Shift counts greater then 15 bits return zero.

processor	Latency	Throughput
power8	4-11	2/cycle
power9	5-11	2/cycle

Parameters

vra	a 128-bit vector treated as a vector unsigned short.
shb	Shift amount in the range 0-15.

Returns: 128-bit vector unsigned short, shifted right shb bits.

◆ vec_vmaddeuh()

static vui32_t vec_vmaddeuh	(	vui16_t	a,
		vui16_t	b,
		vui16_t	c
	)

inlinestatic

Vector Multiply-Add Even Unsigned Halfwords.

Multiply the even 16-bit Words of vector unsigned short values (a * b) and return sums of the unsigned 32-bit product and the even 16-bit halfwords of c (a_even * b_even) + EXTZ(c_even).

Note: The advantage of this form (versus Multiply-Sum) is that the final 32 bit sums can not overflow.; This implementation is NOT endian sensitive and the function is stable across BE/LE implementations.

processor	Latency	Throughput
power8	9-18	2/cycle
power9	9-16	2/cycle

Parameters

a	128-bit vector unsigned short.
b	128-bit vector unsigned short.
c	128-bit vector unsigned short.

Returns: vector unsigned int sum (a_even * b_even) + EXTZ(c_even).

◆ vec_vmaddouh()

static vui32_t vec_vmaddouh	(	vui16_t	a,
		vui16_t	b,
		vui16_t	c
	)

inlinestatic

Vector Multiply-Add Odd Unsigned Halfwords.

Multiply the odd 16-bit Halfwords of vector unsigned short values (a * b) and return sums of the unsigned 32-bit product and the odd 16-bit halfwords of c (a_odd * b_odd) + EXTZ(c_odd).

Note: The advantage of this form (versus Multiply-Sum) is that the final 32 bit sums can not overflow.; This implementation is NOT endian sensitive and the function is stable across BE/LE implementations.

processor	Latency	Throughput
power8	9-18	2/cycle
power9	9-16	2/cycle

Parameters

a	128-bit vector unsigned short.
b	128-bit vector unsigned short.
c	128-bit vector unsigned short.

Returns: vector unsigned int sum (a_odd * b_odd) + EXTZ(c_odd).

◆ vec_vmrgeh()

static vui16_t vec_vmrgeh	(	vui16_t	vra,
		vui16_t	vrb
	)

inlinestatic

Vector Merge Even Halfwords.

Merge the even halfword elements from the concatenation of 2 x vectors (vra and vrb).

Note

This function implements the operation of a Vector Merge Even Halfword instruction, if the PowerISA included such an instruction. This implementation is NOT endian sensitive and the function is stable across BE/LE implementations. Using big-endian element numbering:

res[0] = vra[0];
res[1] = vrb[0];
res[2] = vra[2];
res[3] = vrb[2];
res[4] = vra[4];
res[5] = vrb[4];
res[6] = vra[6];
res[7] = vrb[6];

processor	Latency	Throughput
power8	2-13	2/cycle
power9	3-14	2/cycle

Parameters

vra	128-bit vector unsigned short.
vrb	128-bit vector unsigned short.

Returns: A vector merge from only the even halfwords of vra and vrb.

◆ vec_vmrgoh()

static vui16_t vec_vmrgoh	(	vui16_t	vra,
		vui16_t	vrb
	)

inlinestatic

Vector Merge Odd Halfwords.

Merge the odd halfword elements from the concatenation of 2 x vectors (vra and vrb).

Note

This function implements the operation of a Vector Merge Odd Halfword instruction, if the PowerISA included such an instruction. This implementation is NOT endian sensitive and the function is stable across BE/LE implementations. Using big-endian element numbering:

res[0] = vra[1];
res[1] = vrb[1];
res[2] = vra[3];
res[3] = vrb[3];
res[4] = vra[5];
res[5] = vrb[5];
res[6] = vra[7];
res[7] = vrb[7];

processor	Latency	Throughput
power8	2-13	2/cycle
power9	3-14	2/cycle

Parameters

vra	128-bit vector unsigned short.
vrb	128-bit vector unsigned short.

Returns: A vector merge from only the odd halfwords of vra and vrb.

Functions

Detailed Description

Recent Additions

Endian problems with halfword operations

Multiply High Unsigned Halfword Example

Examples, Divide by integer constant

Divide by constant 10 examples

Divide by constant 10000 example

Performance data.

More information.

Function Documentation

◆ vec_absduh()

◆ vec_clzh()

◆ vec_ctzh()

◆ vec_mrgahh()

◆ vec_mrgalh()

◆ vec_mrgeh()

◆ vec_mrgoh()

◆ vec_mulhsh()

◆ vec_mulhuh()

◆ vec_muluhm()

◆ vec_popcnth()

◆ vec_revbh()

◆ vec_setb_sh()

◆ vec_slhi()

◆ vec_srahi()

◆ vec_srhi()

◆ vec_vmaddeuh()

◆ vec_vmaddouh()

◆ vec_vmrgeh()

◆ vec_vmrgoh()