Discussion:
[RFC PATCH] Enable V32HI/V64QI const permutations
Jakub Jelinek
2014-10-03 14:39:08 UTC
Permalink
Hi!

Just to stress the new testcases some more, I've enabled the
vec_perm_const{32hi,64qi} patterns.
Got several ICEs in expand_vec_perm_broadcast_1,
on the final gcc_unreachable () in the function. That function
is only called if it couldn't be broadcasted in a single insn,
which I believe for TARGET_AVX512BW must be always possible.
Shall I look at this, or do you plan to address this in the near future?

2014-10-03 Jakub Jelinek <***@redhat.com>

* config/i386/sse.md (VEC_PERM_CONST): Add V32HI and V64QI
for TARGET_AVX512BW.

--- gcc/config/i386/sse.md.jj 2014-09-26 10:33:18.000000000 +0200
+++ gcc/config/i386/sse.md 2014-10-03 15:03:44.170446452 +0200
@@ -10386,7 +10386,8 @@ (define_mode_iterator VEC_PERM_CONST
(V8SI "TARGET_AVX") (V4DI "TARGET_AVX")
(V32QI "TARGET_AVX2") (V16HI "TARGET_AVX2")
(V16SI "TARGET_AVX512F") (V8DI "TARGET_AVX512F")
- (V16SF "TARGET_AVX512F") (V8DF "TARGET_AVX512F")])
+ (V16SF "TARGET_AVX512F") (V8DF "TARGET_AVX512F")
+ (V32HI "TARGET_AVX512BW") (V64QI "TARGET_AVX512BW")])

(define_expand "vec_perm_const<mode>"
[(match_operand:VEC_PERM_CONST 0 "register_operand")

Jakub
Jakub Jelinek
2014-10-06 07:08:41 UTC
Permalink
Post by Jakub Jelinek
Just to stress the new testcases some more, I've enabled the
vec_perm_const{32hi,64qi} patterns.
Got several ICEs in expand_vec_perm_broadcast_1,
on the final gcc_unreachable () in the function. That function
is only called if it couldn't be broadcasted in a single insn,
which I believe for TARGET_AVX512BW must be always possible.
Shall I look at this, or do you plan to address this in the near future?
Speaking of -mavx512{bw,vl,f}, there apparently is a full 2 operand shuffle
for V32HI, V16S[IF], V8D[IF], so the only one instruction full
2 operand shuffle we are missing is V64QI, right?

What would be best worst case sequence for that?

I'd think 2x vpermi2w, 2x vpshufb and one vpor could achieve that,
(first vpermi2w would put the even bytes into the right word positions
(i.e. at the right position or one above it), second vpermi2w would put
the odd bytes into the right word positions (i.e. at the right position
or one below it), each vpshufb would swap the byte pairs where necessary
and zero out the other (odd or even) byte,
and vpor merge the results), do you have something better?
What about arbitrary one operand V64QI const permutation?

Jakub
Ilya Tocar
2014-10-06 14:09:07 UTC
Permalink
Post by Jakub Jelinek
Post by Jakub Jelinek
Just to stress the new testcases some more, I've enabled the
vec_perm_const{32hi,64qi} patterns.
Got several ICEs in expand_vec_perm_broadcast_1,
on the final gcc_unreachable () in the function. That function
is only called if it couldn't be broadcasted in a single insn,
which I believe for TARGET_AVX512BW must be always possible.
Shall I look at this, or do you plan to address this in the near future?
Speaking of -mavx512{bw,vl,f}, there apparently is a full 2 operand shuffle
for V32HI, V16S[IF], V8D[IF], so the only one instruction full
2 operand shuffle we are missing is V64QI, right?
What would be best worst case sequence for that?
I'd think 2x vpermi2w, 2x vpshufb and one vpor could achieve that,
(first vpermi2w would put the even bytes into the right word positions
(i.e. at the right position or one above it), second vpermi2w would put
the odd bytes into the right word positions (i.e. at the right position
or one below it),
I think we will also need to spend insns converting byte-sized mask into
word-sized mask.
Post by Jakub Jelinek
each vpshufb would swap the byte pairs where necessary
and zero out the other (odd or even) byte,
This will probably also require vpshufb mask preparation (setting high
bit for zeroing)
Post by Jakub Jelinek
and vpor merge the results), do you have something better?
Currently (in branch) it's implemented as 2x vpermi2w + 4x shift +
blend. 3 shifts to prepare masks for vpermi2w,
2 vpermi2w to put odd/even bytes in low part of right position,
shift to move low part into high part and finally blend with
101010.. mask to get a result.
Post by Jakub Jelinek
What about arbitrary one operand V64QI const permutation?
Currently it loads const-vector into register and uses the same codepath
as non-const version (this probably can be improved).
Jakub Jelinek
2014-10-06 15:32:18 UTC
Permalink
Post by Ilya Tocar
Post by Jakub Jelinek
Speaking of -mavx512{bw,vl,f}, there apparently is a full 2 operand shuffle
for V32HI, V16S[IF], V8D[IF], so the only one instruction full
2 operand shuffle we are missing is V64QI, right?
What would be best worst case sequence for that?
I'd think 2x vpermi2w, 2x vpshufb and one vpor could achieve that,
(first vpermi2w would put the even bytes into the right word positions
(i.e. at the right position or one above it), second vpermi2w would put
the odd bytes into the right word positions (i.e. at the right position
or one below it),
I think we will also need to spend insns converting byte-sized mask into
word-sized mask.
I'm talking about the constant permutations here (see my other mail to
Kirill). In that case, you can tweak the mask as much as you want.

I mean something like (completely untested, would need a separate function):

if (TARGET_AVX512BW && d->vmode == V64QImode)
;
else
return false;

/* We can emit arbitrary two operand V64QImode permutations
with 2 vpermi2w, 2 vpshufb and one vpor instruction. */
if (d->testing_p)
return true;

struct expand_vec_perm_d ds[2];
rtx rperm[128], vperm, target0, target1;
for (i = 0; i < 2; i++)
{
ds[i] = *d;
ds[i].vmode = V32HImode;
ds[i].nelt = 32;
ds[i].target = gen_reg_rtx (V32HImode);
ds[i].op0 = gen_lowpart (V32HImode, d->op0);
ds[i].op1 = gen_lowpart (V32HImode, d->op1);
}
/* Prepare permutations such that the first one takes care of
putting the even bytes into the right positions or one higher
positions (ds[0]) and the second one takes care of
putting the odd bytes into the right positions or one below
(ds[1]).

for (i = 0; i < nelt; i++)
{
ds[i & 1].perm[i / 2] = d->perm[i] / 2;
if (i & 1)
{
rperm[i] = constm1_rtx;
rperm[i + 64] = GEN_INT ((i & 14) + 1 - (d->perm[i] & 1));
}
else
{
rperm[i] = GEN_INT ((i & 14) + (d->perm[i] & 1));
rperm[i + 64] = constm1_rtx;
}
}

bool ok = expand_vec_perm_1 (&ds[0]);
gcc_assert (ok);
ds[0].target = gen_lowpart (V64QImode, ds[0].target);

ok = expand_vec_perm_1 (&ds[1]);
gcc_assert (ok);
ds[1].target = gen_lowpart (V64QImode, ds[1].target);

vperm = gen_rtx_CONST_VECTOR (V64QImode, gen_rtvec_v (64, rperm));
vperm = force_reg (vmode, vperm);
target0 = gen_reg_rtx (V64QImode);
emit_insn (gen_avx512bw_pshufbv64qi3 (target0, ds[0].target, vperm));

vperm = gen_rtx_CONST_VECTOR (V64QImode, gen_rtvec_v (64, rperm + 64));
vperm = force_reg (vmode, vperm);
target1 = gen_reg_rtx (V64QImode);
emit_insn (gen_avx512bw_pshufbv64qi3 (target1, ds[1].target, vperm));

emit_insn (gen_iorv64qi3 (d->target, target0, target1));

Jakub

Kirill Yukhin
2014-10-06 12:46:29 UTC
Permalink
Hello Jakub,
Post by Jakub Jelinek
--- gcc/config/i386/sse.md.jj 2014-09-26 10:33:18.000000000 +0200
+++ gcc/config/i386/sse.md 2014-10-03 15:03:44.170446452 +0200
@@ -10386,7 +10386,8 @@ (define_mode_iterator VEC_PERM_CONST
(V8SI "TARGET_AVX") (V4DI "TARGET_AVX")
(V32QI "TARGET_AVX2") (V16HI "TARGET_AVX2")
(V16SI "TARGET_AVX512F") (V8DI "TARGET_AVX512F")
- (V16SF "TARGET_AVX512F") (V8DF "TARGET_AVX512F")])
+ (V16SF "TARGET_AVX512F") (V8DF "TARGET_AVX512F")
+ (V32HI "TARGET_AVX512BW") (V64QI "TARGET_AVX512BW")])
This is subject of 63/n patch (will post tomorrow).

--
Thanks, K
Loading...