Discussion:
[PATCH 0/14+2][Vectorizer] Made reductions endianness-neutral, fixes PR/61114
Alan Lawrence
2014-09-18 11:41:06 UTC
Permalink
The end goal here is to remove this code from tree-vect-loop.c
(vect_create_epilog_for_reduction):

if (BYTES_BIG_ENDIAN)
bitpos = size_binop (MULT_EXPR,
bitsize_int (TYPE_VECTOR_SUBPARTS (vectype) - 1),
TYPE_SIZE (scalar_type));
else

as this is the root cause of PR/61114 (see testcase there, failing on all
bigendian targets supporting reduc_[us]plus_optab). Quoting Richard Biener, "all
code conditional on BYTES/WORDS_BIG_ENDIAN in tree-vect* is suspicious". The
code snippet above is used on two paths:

(Path 1) (patches 1-6) Reductions using REDUC_(PLUS|MIN|MAX)_EXPR =
reduc_[us](plus|min|max)_optab.
The optab is documented as "the scalar result is stored in the least significant
bits of operand 0", but the tree code as "the first element in the vector
holding the result of the reduction of all elements of the operand". This
mismatch means that when the tree code is folded, the code snippet above reads
the result from the wrong end of the vector.

The strategy (as per https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html)
is to define new tree codes and optabs that produce scalar results directly;
this seems better than tying (the element of the vector into which the result is
placed) to (the endianness of the target), and avoids generating extra moves on
current bigendian targets. However, the previous optabs are retained for now as
a migration strategy so as not to break existing backends; moving individual
platforms over will follow.

A complication here is on AArch64, where we directly generate REDUC_PLUS_EXPRs
from intrinsics in gimple_fold_builtin; I temporarily remove this folding in
order to decouple the midend and AArch64 backend.

(Path 2) (patches 7-13) Reductions using whole-vector-shifts, i.e.
VEC_RSHIFT_EXPR and vec_shr_optab. Here the tree code as well as the optab is
defined in an endianness-dependent way, leading to significant complication in
fold-const.c. (Moreover, the "equivalent" vec_shl_optab is never used!). Few
platforms appear to handle vec_shr_optab (and fewer bigendian - I see only
PowerPC and MIPS), so it seems pertinent to change the existing optab to be
endianness-neutral.

Patch 10 defines vec_shr for AArch64, for the old specification; patch 13
updates that implementation to fit the new endianness-neutral specification,
serving as a guide for other existing backends. Patches/RFCs 15 and 16 are
equivalents for MIPS and PowerPC; I haven't tested these but hope they act as
useful pointers for the port maintainers.

Finally patch 14 cleans up the affected part of tree-vect-loop.c
(vect_create_epilog_for_reduction).

--Alan
Alan Lawrence
2014-09-18 11:45:34 UTC
Permalink
The gimple folding ties the AArch64 backend to the tree representation of the
midend via the neon intrinsics. This code enables constant folding of Neon
intrinsics reduction ops, so improves performance, but is not necessary for
correctness. By temporarily removing it (here), we can then change the midend
representation independently of the AArch64 backend + intrinsics.

However, I'm leaving the code in place, as a later patch will bring it all back
in a very similar form (but enabled for bigendian).

Bootstrapped on aarch64-none-linux; tested aarch64.exp on aarch64-none-elf and
aarch64_be-none-elf. (The removed code was already disabled for bigendian; and
this is solely a __builtin-folding mechanism, i.e. used only for Neon/ACLE
intrinsics.)

gcc/ChangeLog:
* config/aarch64/aarch64.c (TARGET_GIMPLE_FOLD_BUILTIN): Comment out.
* config/aarch64/aarch64-builtins.c (aarch64_gimple_fold_builtin):
Remove using preprocessor directives.
Marcus Shawcroft
2014-09-24 09:41:51 UTC
Permalink
Post by Alan Lawrence
The gimple folding ties the AArch64 backend to the tree representation of
the midend via the neon intrinsics. This code enables constant folding of
Neon intrinsics reduction ops, so improves performance, but is not necessary
for correctness. By temporarily removing it (here), we can then change the
midend representation independently of the AArch64 backend + intrinsics.
However, I'm leaving the code in place, as a later patch will bring it all
back in a very similar form (but enabled for bigendian).
Bootstrapped on aarch64-none-linux; tested aarch64.exp on aarch64-none-elf
and aarch64_be-none-elf. (The removed code was already disabled for
bigendian; and this is solely a __builtin-folding mechanism, i.e. used only
for Neon/ACLE intrinsics.)
* config/aarch64/aarch64.c (TARGET_GIMPLE_FOLD_BUILTIN): Comment out.
Remove using preprocessor directives.
OK /Marcus
Alan Lawrence
2014-09-18 11:50:52 UTC
Permalink
This fixes PR/61114 by redefining the REDUC_{MIN,MAX,PLUS}_EXPR tree codes.

These are presently documented as producing a vector with the result in element
0, and this is inconsistent with their use in tree-vect-loop.c (which on
bigendian targets pulls the bits out of the wrong end of the vector result).
This leads to bugs on bigendian targets - see also
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61114.

I discounted "fixing" the vectorizer (to read from element 0) and then making
bigendian targets (whose architectural insn produces the result in lane N-1)
permute the result vector, as optimization of vectors in RTL seems unlikely to
remove such a permute and would lead to a performance regression.

Instead it seems more natural for the tree code to produce a scalar result
(producing a vector with the result in lane 0 has already caused confusion, e.g.
https://gcc.gnu.org/ml/gcc-patches/2012-10/msg01100.html).

However, this patch preserves the meaning of the optab (producing a result in
lane 0 on little-endian architectures or N-1 on bigendian), thus generally
avoiding the need to change backends. Thus, expr.c extracts an
endianness-dependent element from the optab result to give the result expected
for the tree code.

Previously posted as an RFC
https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html , now with an extra
VIEW_CONVERT_EXPR if the types of the reduction/result do not match.

Testing:
x86_86-none-linux-gnu: bootstrap, check-gcc, check-g++
aarch64-none-linux-gnu: bootstrap
aarch64-none-elf: check-gcc, check-g++
arm-none-eabi: check-gcc

aarch64_be-none-elf: check-gcc, showing
FAIL->PASS: gcc.dg/vect/no-scevccp-outer-7.c execution test
FAIL->PASS: gcc.dg/vect/no-scevccp-outer-13.c execution test
Passes the (previously-failing) reduced testcase on
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61114

Have also assembler/stage-1 tested that testcase on PowerPC, also fixed.

gcc/ChangeLog:

* expr.c (expand_expr_real_2): For REDUC_{MIN,MAX,PLUS}_EXPR, add
extract_bit_field around optab result.

* fold-const.c (fold_unary_loc): For REDUC_{MIN,MAX,PLUS}_EXPR, produce
scalar not vector.

* tree-cfg.c (verify_gimple_assign_unary): Check result vs operand type
for REDUC_{MIN,MAX,PLUS}_EXPR.

* tree-vect-loop.c (vect_analyze_loop): Update comment.
(vect_create_epilog_for_reduction): For direct vector reduction, use
result of tree code directly without extract_bit_field.

* tree.def (REDUC_MAX_EXPR, REDUC_MIN_EXPR, REDUC_PLUS_EXPR): Update
comment.
Richard Biener
2014-09-22 10:34:12 UTC
Permalink
Post by Alan Lawrence
This fixes PR/61114 by redefining the REDUC_{MIN,MAX,PLUS}_EXPR tree codes.
These are presently documented as producing a vector with the result in
element 0, and this is inconsistent with their use in tree-vect-loop.c
(which on bigendian targets pulls the bits out of the wrong end of the
vector result). This leads to bugs on bigendian targets - see also
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61114.
I discounted "fixing" the vectorizer (to read from element 0) and then
making bigendian targets (whose architectural insn produces the result in
lane N-1) permute the result vector, as optimization of vectors in RTL seems
unlikely to remove such a permute and would lead to a performance
regression.
Instead it seems more natural for the tree code to produce a scalar result
(producing a vector with the result in lane 0 has already caused confusion,
e.g. https://gcc.gnu.org/ml/gcc-patches/2012-10/msg01100.html).
However, this patch preserves the meaning of the optab (producing a result
in lane 0 on little-endian architectures or N-1 on bigendian), thus
generally avoiding the need to change backends. Thus, expr.c extracts an
endianness-dependent element from the optab result to give the result
expected for the tree code.
Previously posted as an RFC
https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html , now with an extra
VIEW_CONVERT_EXPR if the types of the reduction/result do not match.
Huh. Does that ever happen? Please use a NOP_EXPR instead of
a VIEW_CONVERT_EXPR.

Ok with that change.

Thanks,
Richard.
Post by Alan Lawrence
x86_86-none-linux-gnu: bootstrap, check-gcc, check-g++
aarch64-none-linux-gnu: bootstrap
aarch64-none-elf: check-gcc, check-g++
arm-none-eabi: check-gcc
aarch64_be-none-elf: check-gcc, showing
FAIL->PASS: gcc.dg/vect/no-scevccp-outer-7.c execution test
FAIL->PASS: gcc.dg/vect/no-scevccp-outer-13.c execution test
Passes the (previously-failing) reduced testcase on
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61114
Have also assembler/stage-1 tested that testcase on PowerPC, also fixed.
* expr.c (expand_expr_real_2): For REDUC_{MIN,MAX,PLUS}_EXPR, add
extract_bit_field around optab result.
* fold-const.c (fold_unary_loc): For REDUC_{MIN,MAX,PLUS}_EXPR, produce
scalar not vector.
* tree-cfg.c (verify_gimple_assign_unary): Check result vs operand type
for REDUC_{MIN,MAX,PLUS}_EXPR.
* tree-vect-loop.c (vect_analyze_loop): Update comment.
(vect_create_epilog_for_reduction): For direct vector reduction, use
result of tree code directly without extract_bit_field.
* tree.def (REDUC_MAX_EXPR, REDUC_MIN_EXPR, REDUC_PLUS_EXPR): Update
comment.
Alan Lawrence
2014-09-22 13:23:02 UTC
Permalink
Post by Richard Biener
Huh. Does that ever happen? Please use a NOP_EXPR instead of
a VIEW_CONVERT_EXPR.
Yes, the testcase is gcc.target/i386/pr51235.c which performs black magic***
with void *. (This testcase otherwise fails the verify_gimple_assign_unary check
in tree-cfg.c .) However, test passes also with your suggestion of NOP_EXPR so
that's good by me.

***that is, computes the minimum

--Alan
Post by Richard Biener
Ok with that change.
Thanks,
Richard.
Post by Alan Lawrence
x86_86-none-linux-gnu: bootstrap, check-gcc, check-g++
aarch64-none-linux-gnu: bootstrap
aarch64-none-elf: check-gcc, check-g++
arm-none-eabi: check-gcc
aarch64_be-none-elf: check-gcc, showing
FAIL->PASS: gcc.dg/vect/no-scevccp-outer-7.c execution test
FAIL->PASS: gcc.dg/vect/no-scevccp-outer-13.c execution test
Passes the (previously-failing) reduced testcase on
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61114
Have also assembler/stage-1 tested that testcase on PowerPC, also fixed.
* expr.c (expand_expr_real_2): For REDUC_{MIN,MAX,PLUS}_EXPR, add
extract_bit_field around optab result.
* fold-const.c (fold_unary_loc): For REDUC_{MIN,MAX,PLUS}_EXPR, produce
scalar not vector.
* tree-cfg.c (verify_gimple_assign_unary): Check result vs operand type
for REDUC_{MIN,MAX,PLUS}_EXPR.
* tree-vect-loop.c (vect_analyze_loop): Update comment.
(vect_create_epilog_for_reduction): For direct vector reduction, use
result of tree code directly without extract_bit_field.
* tree.def (REDUC_MAX_EXPR, REDUC_MIN_EXPR, REDUC_PLUS_EXPR): Update
comment.
Alan Lawrence
2014-09-24 15:02:11 UTC
Permalink
So it looks like patches 1-6 (reduc_foo) are relatively close to final, and
given these fix PR/61114, I'm gonna try to land these while working on a respin
of the second half (vec_shr)...(summary: yes I like the vec_perm idea too, but
the devil is in the detail!)

However my CompileFarm account is still pending, so to that end, if you were
able to test patch 2/14 (attached inc. Richie's s/VIEW_CONVERT_EXPR/NOP_EXPR/)
on the CompileFarm PowerPC machine, that'd be great, many thanks indeed. It
should apply on its own without patch 1. I'll aim to get an alternative patch 3
back to the list shortly, and follow up with .md updates to the various backends.

Cheers, Alan
Post by Richard Biener
Post by Alan Lawrence
This fixes PR/61114 by redefining the REDUC_{MIN,MAX,PLUS}_EXPR tree codes.
These are presently documented as producing a vector with the result in
element 0, and this is inconsistent with their use in tree-vect-loop.c
(which on bigendian targets pulls the bits out of the wrong end of the
vector result). This leads to bugs on bigendian targets - see also
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61114.
I discounted "fixing" the vectorizer (to read from element 0) and then
making bigendian targets (whose architectural insn produces the result in
lane N-1) permute the result vector, as optimization of vectors in RTL seems
unlikely to remove such a permute and would lead to a performance
regression.
Instead it seems more natural for the tree code to produce a scalar result
(producing a vector with the result in lane 0 has already caused confusion,
e.g. https://gcc.gnu.org/ml/gcc-patches/2012-10/msg01100.html).
However, this patch preserves the meaning of the optab (producing a result
in lane 0 on little-endian architectures or N-1 on bigendian), thus
generally avoiding the need to change backends. Thus, expr.c extracts an
endianness-dependent element from the optab result to give the result
expected for the tree code.
Previously posted as an RFC
https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html , now with an extra
VIEW_CONVERT_EXPR if the types of the reduction/result do not match.
Huh. Does that ever happen? Please use a NOP_EXPR instead of
a VIEW_CONVERT_EXPR.
Ok with that change.
Thanks,
Richard.
Post by Alan Lawrence
x86_86-none-linux-gnu: bootstrap, check-gcc, check-g++
aarch64-none-linux-gnu: bootstrap
aarch64-none-elf: check-gcc, check-g++
arm-none-eabi: check-gcc
aarch64_be-none-elf: check-gcc, showing
FAIL->PASS: gcc.dg/vect/no-scevccp-outer-7.c execution test
FAIL->PASS: gcc.dg/vect/no-scevccp-outer-13.c execution test
Passes the (previously-failing) reduced testcase on
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61114
Have also assembler/stage-1 tested that testcase on PowerPC, also fixed.
* expr.c (expand_expr_real_2): For REDUC_{MIN,MAX,PLUS}_EXPR, add
extract_bit_field around optab result.
* fold-const.c (fold_unary_loc): For REDUC_{MIN,MAX,PLUS}_EXPR, produce
scalar not vector.
* tree-cfg.c (verify_gimple_assign_unary): Check result vs operand type
for REDUC_{MIN,MAX,PLUS}_EXPR.
* tree-vect-loop.c (vect_analyze_loop): Update comment.
(vect_create_epilog_for_reduction): For direct vector reduction, use
result of tree code directly without extract_bit_field.
* tree.def (REDUC_MAX_EXPR, REDUC_MIN_EXPR, REDUC_PLUS_EXPR): Update
comment.
Segher Boessenkool
2014-09-24 18:08:13 UTC
Permalink
Post by Alan Lawrence
However my CompileFarm account is still pending, so to that end, if you
were able to test patch 2/14 (attached inc. Richie's
s/VIEW_CONVERT_EXPR/NOP_EXPR/) on the CompileFarm PowerPC machine, that'd
be great, many thanks indeed. It should apply on its own without patch 1.
Patch 2/14 on its own has no regressions on gcc110 (powerpc64-linux,
c,c++,fortran, -m64,-m32,-m32/-mpowerpc64,-m64/-mlra).

Cheers,


Segher
Alan Lawrence
2014-09-25 16:07:16 UTC
Permalink
Many thanks indeed! :)

--Alan
Post by Segher Boessenkool
Post by Alan Lawrence
However my CompileFarm account is still pending, so to that end, if you
were able to test patch 2/14 (attached inc. Richie's
s/VIEW_CONVERT_EXPR/NOP_EXPR/) on the CompileFarm PowerPC machine, that'd
be great, many thanks indeed. It should apply on its own without patch 1.
Patch 2/14 on its own has no regressions on gcc110 (powerpc64-linux,
c,c++,fortran, -m64,-m32,-m32/-mpowerpc64,-m64/-mlra).
Cheers,
Segher
Alan Lawrence
2014-09-18 11:54:40 UTC
Permalink
These match their corresponding tree codes, by taking a vector and returning a
scalar; this is more architecturally neutral than the (somewhat loosely defined)
previous optab that took a vector and returned a vector with the result in the
least significant bits (i.e. element 0 for little-endian or N-1 for bigendian).
However, the old optabs are preserved so as not to break existing backends, so
clients check for both old + new optabs.

Bootstrap, check-gcc and check-g++ on x86_64-none-linux-gnu.
aarch64.exp + vect.exp on aarch64{,_be}-none-elf.
(of course at this point in the series all these are using the old optab +
migration path.)

gcc/ChangeLog:

* doc/md.texi (Standard Names): Add reduc_(plus,[us](min|max))|scal
optabs, and note in reduc_[us](plus|min|max) to prefer the former.

* expr.c (expand_expr_real_2): Use reduc_..._scal if available, fall
back to old reduc_... + BIT_FIELD_REF only if not.

* optabs.c (optab_for_tree_code): for REDUC_(MAX,MIN,PLUS)_EXPR,
return the reduce-to-scalar (reduc_..._scal) optab.
(scalar_reduc_to_vector): New.

* optabs.def (reduc_smax_scal_optab, reduc_smin_scal_optab,
reduc_plus_scal_optab, reduc_umax_scal_optab, reduc_umin_scal_optab):
New.

* optabs.h (scalar_reduc_to_vector): Declare.

* tree-vect-loop.c (vectorizable_reduction): Look for optabs reducing
to either scalar or vector.
Richard Biener
2014-09-22 10:40:21 UTC
Permalink
Post by Alan Lawrence
These match their corresponding tree codes, by taking a vector and returning
a scalar; this is more architecturally neutral than the (somewhat loosely
defined) previous optab that took a vector and returned a vector with the
result in the least significant bits (i.e. element 0 for little-endian or
N-1 for bigendian). However, the old optabs are preserved so as not to break
existing backends, so clients check for both old + new optabs.
Bootstrap, check-gcc and check-g++ on x86_64-none-linux-gnu.
aarch64.exp + vect.exp on aarch64{,_be}-none-elf.
(of course at this point in the series all these are using the old optab +
migration path.)
scalar_reduc_to_vector misses a comment.

I wonder if at the end we wouldn't transition all backends and then
renaming reduc_*_scal_optab back to reduc_*_optab makes sense.

The optabs have only one mode - I wouldn't be surprised if an ISA
invents for example v4si -> di reduction? So do we want to make
reduc_plus_scal_optab a little bit more future proof (maybe there
is already an ISA that supports this kind of reduction?).

Otherwise the patch looks good to me.

Thanks,
Richard.
Post by Alan Lawrence
* doc/md.texi (Standard Names): Add reduc_(plus,[us](min|max))|scal
optabs, and note in reduc_[us](plus|min|max) to prefer the former.
* expr.c (expand_expr_real_2): Use reduc_..._scal if available, fall
back to old reduc_... + BIT_FIELD_REF only if not.
* optabs.c (optab_for_tree_code): for REDUC_(MAX,MIN,PLUS)_EXPR,
return the reduce-to-scalar (reduc_..._scal) optab.
(scalar_reduc_to_vector): New.
* optabs.def (reduc_smax_scal_optab, reduc_smin_scal_optab,
reduc_plus_scal_optab, reduc_umax_scal_optab,
New.
* optabs.h (scalar_reduc_to_vector): Declare.
* tree-vect-loop.c (vectorizable_reduction): Look for optabs reducing
to either scalar or vector.
Alan Lawrence
2014-09-22 13:26:17 UTC
Permalink
Post by Richard Biener
scalar_reduc_to_vector misses a comment.
Ok to reuse the comment in optabs.h in optabs.c also?
Post by Richard Biener
I wonder if at the end we wouldn't transition all backends and then
renaming reduc_*_scal_optab back to reduc_*_optab makes sense.
Yes, that sounds like a plan, the _scal is a bit of a mouthful.
Post by Richard Biener
The optabs have only one mode - I wouldn't be surprised if an ISA
invents for example v4si -> di reduction? So do we want to make
reduc_plus_scal_optab a little bit more future proof (maybe there
is already an ISA that supports this kind of reduction?).
That sounds like a plausible thing for an ISA to do, indeed. However given these
names are only used by the autovectorizer rather than directly, the question is
what the corresponding source code looks like, and/or what changes to the
autovectorizer we might have to make to (look for code to) exploit such an
instruction. At this point I could go for a
reduc_{plus,min_max}_scal_<mode><mode> which reduces from the first vector mode
to the second scalar mode, and then make the vectorizer look only for cases
where the second mode was the element type of the first; but I'm not sure I want
to do anything more complicated than that at this stage. (However, indeed it
would leave the possibility open for the future.)

--Alan
Richard Biener
2014-09-22 13:38:13 UTC
Permalink
Post by Alan Lawrence
Post by Richard Biener
scalar_reduc_to_vector misses a comment.
Ok to reuse the comment in optabs.h in optabs.c also?
Sure.
Post by Alan Lawrence
Post by Richard Biener
I wonder if at the end we wouldn't transition all backends and then
renaming reduc_*_scal_optab back to reduc_*_optab makes sense.
Yes, that sounds like a plan, the _scal is a bit of a mouthful.
Post by Richard Biener
The optabs have only one mode - I wouldn't be surprised if an ISA
invents for example v4si -> di reduction? So do we want to make
reduc_plus_scal_optab a little bit more future proof (maybe there
is already an ISA that supports this kind of reduction?).
That sounds like a plausible thing for an ISA to do, indeed. However given
these names are only used by the autovectorizer rather than directly, the
question is what the corresponding source code looks like, and/or what
changes to the autovectorizer we might have to make to (look for code to)
exploit such an instruction.
Ah, indeed. Would be sth like a REDUC_WIDEN_SUM_EXPR or so.
Post by Alan Lawrence
At this point I could go for a
reduc_{plus,min_max}_scal_<mode><mode> which reduces from the first vector
mode to the second scalar mode, and then make the vectorizer look only for
cases where the second mode was the element type of the first; but I'm not
sure I want to do anything more complicated than that at this stage.
(However, indeed it would leave the possibility open for the future.)
Yeah, agreed. For the min/max case a widen variant isn't useful anyway.

Thanks,
Richard.
Post by Alan Lawrence
--Alan
Alan Lawrence
2014-09-25 14:32:49 UTC
Permalink
Ok, so, I've tried making reduc_plus optab take two modes: that of the vector to
reduce, and the result; thus allowing platforms to provide a widening reduction.
However, I'm keeping reduc_[us](min|max)_optab with only a single mode, as
widening makes no sense there.

I've not gone as far as making the vectorizer use any such a widening reduction,
however: as previously stated, I'm not really sure what the input source code
for that even looks like (maybe in a language other than C?). If we wanted to do
a non-widening reduction using such an instruction (by discarding the extra
bits), strikes me the platform can/should provide a non-widening optab for that
case...

Testing: bootstrapped on x86_64 linux + check-gcc; cross-tested aarch64-none-elf
check-gcc; cross-tested aarch64_be-none-elf aarch64.exp + vect.exp.

So, my feeling is that the extra complexity here doesn't really buy us anything;
and that if we do want to support / use widening reductions in the future, we
should do so with a separate, reduc_plus_widen... optab, and stick with the
original patch/formulation for now. (In other words: this patch is a guide to
how I think a dual-mode reduc_plus_optab looks, but I don't honestly like it!).

If you agree, I shall transplant the comments on scalar_reduc_to_vector from
this patch into the original, and then post that revised version?


Cheers, Alan
Post by Richard Biener
Post by Alan Lawrence
Post by Richard Biener
scalar_reduc_to_vector misses a comment.
Ok to reuse the comment in optabs.h in optabs.c also?
Sure.
Post by Alan Lawrence
Post by Richard Biener
I wonder if at the end we wouldn't transition all backends and then
renaming reduc_*_scal_optab back to reduc_*_optab makes sense.
Yes, that sounds like a plan, the _scal is a bit of a mouthful.
Post by Richard Biener
The optabs have only one mode - I wouldn't be surprised if an ISA
invents for example v4si -> di reduction? So do we want to make
reduc_plus_scal_optab a little bit more future proof (maybe there
is already an ISA that supports this kind of reduction?).
That sounds like a plausible thing for an ISA to do, indeed. However given
these names are only used by the autovectorizer rather than directly, the
question is what the corresponding source code looks like, and/or what
changes to the autovectorizer we might have to make to (look for code to)
exploit such an instruction.
Ah, indeed. Would be sth like a REDUC_WIDEN_SUM_EXPR or so.
Post by Alan Lawrence
At this point I could go for a
reduc_{plus,min_max}_scal_<mode><mode> which reduces from the first vector
mode to the second scalar mode, and then make the vectorizer look only for
cases where the second mode was the element type of the first; but I'm not
sure I want to do anything more complicated than that at this stage.
(However, indeed it would leave the possibility open for the future.)
Yeah, agreed. For the min/max case a widen variant isn't useful anyway.
Thanks,
Richard.
Post by Alan Lawrence
--Alan
Richard Biener
2014-09-25 15:31:47 UTC
Permalink
Post by Alan Lawrence
Ok, so, I've tried making reduc_plus optab take two modes: that of the
vector to reduce, and the result; thus allowing platforms to provide a
widening reduction. However, I'm keeping reduc_[us](min|max)_optab with only
a single mode, as widening makes no sense there.
I've not gone as far as making the vectorizer use any such a widening
reduction, however: as previously stated, I'm not really sure what the input
source code for that even looks like (maybe in a language other than C?). If
we wanted to do a non-widening reduction using such an instruction (by
discarding the extra bits), strikes me the platform can/should provide a
non-widening optab for that case...
I expect it to apply to sth like

int foo (char *in, int n)
{
int res = 0;
for (int i = 0; i < n; ++i)
res += *in;
return res;
}

where you'd see

temc = *in;
tem = (int)temc;
res += tem;

we probably handle this by widening the chars to ints and unrolling
the loop enough to make that work (thus for n == 16 it would maybe
fail to vectorize?). It should be more efficient to pattern-detect
this as widening reduction.
Post by Alan Lawrence
Testing: bootstrapped on x86_64 linux + check-gcc; cross-tested
aarch64-none-elf check-gcc; cross-tested aarch64_be-none-elf aarch64.exp +
vect.exp.
So, my feeling is that the extra complexity here doesn't really buy us
anything; and that if we do want to support / use widening reductions in the
future, we should do so with a separate, reduc_plus_widen... optab, and
stick with the original patch/formulation for now. (In other words: this
patch is a guide to how I think a dual-mode reduc_plus_optab looks, but I
don't honestly like it!).
If you agree, I shall transplant the comments on scalar_reduc_to_vector from
this patch into the original, and then post that revised version?
I agree. We can come back once a target implements such widening
reduction.

Richard.
Post by Alan Lawrence
Cheers, Alan
Post by Richard Biener
Post by Alan Lawrence
Post by Richard Biener
scalar_reduc_to_vector misses a comment.
Ok to reuse the comment in optabs.h in optabs.c also?
Sure.
Post by Alan Lawrence
Post by Richard Biener
I wonder if at the end we wouldn't transition all backends and then
renaming reduc_*_scal_optab back to reduc_*_optab makes sense.
Yes, that sounds like a plan, the _scal is a bit of a mouthful.
Post by Richard Biener
The optabs have only one mode - I wouldn't be surprised if an ISA
invents for example v4si -> di reduction? So do we want to make
reduc_plus_scal_optab a little bit more future proof (maybe there
is already an ISA that supports this kind of reduction?).
That sounds like a plausible thing for an ISA to do, indeed. However given
these names are only used by the autovectorizer rather than directly, the
question is what the corresponding source code looks like, and/or what
changes to the autovectorizer we might have to make to (look for code to)
exploit such an instruction.
Ah, indeed. Would be sth like a REDUC_WIDEN_SUM_EXPR or so.
Post by Alan Lawrence
At this point I could go for a
reduc_{plus,min_max}_scal_<mode><mode> which reduces from the first vector
mode to the second scalar mode, and then make the vectorizer look only for
cases where the second mode was the element type of the first; but I'm not
sure I want to do anything more complicated than that at this stage.
(However, indeed it would leave the possibility open for the future.)
Yeah, agreed. For the min/max case a widen variant isn't useful anyway.
Thanks,
Richard.
Post by Alan Lawrence
--Alan
Alan Lawrence
2014-09-25 16:12:24 UTC
Permalink
Well, even that C source, you'd need to be careful and ensure that the
vectorized loop never went round more than once, or else the additions within
the loop would be performed in 8 bits, different from the final reduction...

So: original patch with updated commenting attached...Segher, is there any
chance you could test this on powerpc too? (in combination with patch 2/14,
which will need to be applied first; you can skip patch 1, and >=4.)

--Alan
Post by Richard Biener
Post by Alan Lawrence
Ok, so, I've tried making reduc_plus optab take two modes: that of the
vector to reduce, and the result; thus allowing platforms to provide a
widening reduction. However, I'm keeping reduc_[us](min|max)_optab with only
a single mode, as widening makes no sense there.
I've not gone as far as making the vectorizer use any such a widening
reduction, however: as previously stated, I'm not really sure what the input
source code for that even looks like (maybe in a language other than C?). If
we wanted to do a non-widening reduction using such an instruction (by
discarding the extra bits), strikes me the platform can/should provide a
non-widening optab for that case...
I expect it to apply to sth like
int foo (char *in, int n)
{
int res = 0;
for (int i = 0; i < n; ++i)
res += *in;
return res;
}
where you'd see
temc = *in;
tem = (int)temc;
res += tem;
we probably handle this by widening the chars to ints and unrolling
the loop enough to make that work (thus for n == 16 it would maybe
fail to vectorize?). It should be more efficient to pattern-detect
this as widening reduction.
Post by Alan Lawrence
Testing: bootstrapped on x86_64 linux + check-gcc; cross-tested
aarch64-none-elf check-gcc; cross-tested aarch64_be-none-elf aarch64.exp +
vect.exp.
So, my feeling is that the extra complexity here doesn't really buy us
anything; and that if we do want to support / use widening reductions in the
future, we should do so with a separate, reduc_plus_widen... optab, and
stick with the original patch/formulation for now. (In other words: this
patch is a guide to how I think a dual-mode reduc_plus_optab looks, but I
don't honestly like it!).
If you agree, I shall transplant the comments on scalar_reduc_to_vector from
this patch into the original, and then post that revised version?
I agree. We can come back once a target implements such widening
reduction.
Richard.
Post by Alan Lawrence
Cheers, Alan
Post by Richard Biener
Post by Alan Lawrence
Post by Richard Biener
scalar_reduc_to_vector misses a comment.
Ok to reuse the comment in optabs.h in optabs.c also?
Sure.
Post by Alan Lawrence
Post by Richard Biener
I wonder if at the end we wouldn't transition all backends and then
renaming reduc_*_scal_optab back to reduc_*_optab makes sense.
Yes, that sounds like a plan, the _scal is a bit of a mouthful.
Post by Richard Biener
The optabs have only one mode - I wouldn't be surprised if an ISA
invents for example v4si -> di reduction? So do we want to make
reduc_plus_scal_optab a little bit more future proof (maybe there
is already an ISA that supports this kind of reduction?).
That sounds like a plausible thing for an ISA to do, indeed. However given
these names are only used by the autovectorizer rather than directly, the
question is what the corresponding source code looks like, and/or what
changes to the autovectorizer we might have to make to (look for code to)
exploit such an instruction.
Ah, indeed. Would be sth like a REDUC_WIDEN_SUM_EXPR or so.
Post by Alan Lawrence
At this point I could go for a
reduc_{plus,min_max}_scal_<mode><mode> which reduces from the first vector
mode to the second scalar mode, and then make the vectorizer look only for
cases where the second mode was the element type of the first; but I'm not
sure I want to do anything more complicated than that at this stage.
(However, indeed it would leave the possibility open for the future.)
Yeah, agreed. For the min/max case a widen variant isn't useful anyway.
Thanks,
Richard.
Post by Alan Lawrence
--Alan
Segher Boessenkool
2014-09-25 19:19:53 UTC
Permalink
Post by Alan Lawrence
So: original patch with updated commenting attached...Segher, is there any
chance you could test this on powerpc too? (in combination with patch 2/14,
which will need to be applied first; you can skip patch 1, and >=4.)
2+3/14, tested as before, on powerpc64-linux; no regressions.

Cheers,


Segher
Alan Lawrence
2014-09-18 11:59:35 UTC
Permalink
This migrates AArch64 over to the new optab for 'plus' reductions, i.e. so the
define_expands produce scalars by generating a MOV to a GPR. Effectively, this
moves the vget_lane inside every arm_neon.h intrinsic, into the inside of the
define_expand.

Tested: aarch64.exp vect.exp on aarch64-none-elf and aarch64_be-none-elf (full
check-gcc on next patch for reduc_min/max)

gcc/ChangeLog:

* config/aarch64/aarch64-simd-builtins.def
(reduc_splus_<mode>/VDQF, reduc_uplus_<mode>/VDQF, reduc_splus_v4sf):
Remove.
(reduc_plus_scal_<mode>, reduc_plus_scal_v4sf): New.

* config/aarch64/aarch64-simd.md (reduc_<sur>plus_mode): Remove.
(reduc_splus_<mode>, reduc_uplus_<mode>, reduc_plus_scal_<mode>): New.

(reduc_<sur>plus_mode): Change SUADDV -> UNSPEC_ADDV, rename to...
(aarch64_reduc_plus_internal<mode>): ...this.

(reduc_<sur>plus_v2si): Change SUADDV -> UNSPEC_ADDV, rename to...
(aarch64_reduc_plus_internalv2si): ...this.

(reduc_splus_<mode>/V2F): Rename to...
(aarch64_reduc_plus_internal<mode>): ...this.

* config/aarch64/iterators.md
(UNSPEC_SADDV, UNSPEC_UADDV, SUADDV): Remove.
(UNSPEC_ADDV): New.
(sur): Remove elements for UNSPEC_SADDV and UNSPEC_UADDV.

* config/aarch64/arm_neon.h (vaddv_s8, vaddv_s16, vaddv_s32, vaddv_u8,
vaddv_u16, vaddv_u32, vaddvq_s8, vaddvq_s16, vaddvq_s32, vaddvq_s64,
vaddvq_u8, vaddvq_u16, vaddvq_u32, vaddvq_u64, vaddv_f32, vaddvq_f32,
vaddvq_f64): Change __builtin_aarch64_reduc_[us]plus_... to
__builtin_aarch64_reduc_plus_scal, remove vget_lane wrapper.
Marcus Shawcroft
2014-09-24 09:44:51 UTC
Permalink
Post by Alan Lawrence
This migrates AArch64 over to the new optab for 'plus' reductions, i.e. so
the define_expands produce scalars by generating a MOV to a GPR.
Effectively, this moves the vget_lane inside every arm_neon.h intrinsic,
into the inside of the define_expand.
Tested: aarch64.exp vect.exp on aarch64-none-elf and aarch64_be-none-elf
(full check-gcc on next patch for reduc_min/max)
+(define_expand "reduc_splus_<mode>"
+

Can't we just drop the define_expands for the old optabs altogether?

/Marcus
Alan Lawrence
2014-09-18 12:02:10 UTC
Permalink
Similarly to the previous patch (r/2205), this migrates AArch64 to the new
reduce-to-scalar optabs for min and max. For consistency we apply the same
treatment to the smax_nan and smin_nan patterns (used for __builtins), even
though reduc_smin_nan_scal (etc.) is not a standard name.

Tested: check-gcc on aarch64-none-elf and aarch64_be-none-elf.

gcc/ChangeLog:

* config/aarch64/aarch64-simd-builtins.def (reduc_smax_, reduc_smin_,
reduc_umax_, reduc_umin_, reduc_smax_nan_, reduc_smin_nan_): Remove.
(reduc_smax_scal_, reduc_smin_scal_, reduc_umax_scal_,
reduc_umin_scal_, reduc_smax_nan_scal_, reduc_smin_nan_scal_): New.

* config/aarch64/aarch64-simd.md
(reduc_<maxmin_uns>_<mode>): Rename VDQV_S variant to...
(reduc_<maxmin_uns>_internal<mode>): ...this.
(reduc_<maxmin_uns>_<mode>): New (VDQ_BHSI).
(reduc_<maxmin_uns>_scal_<mode>): New (*2).

(reduc_<maxmin_uns>_v2si): Combine with below, renaming...
(reduc_<maxmin_uns>_<mode>): Combine V2F with above, renaming...
(reduc_<maxmin_uns>_internal_<mode>): ...to this (VDQF).

* config/aarch64/arm_neon.h (vmaxv_f32, vmaxv_s8, vmaxv_s16,
vmaxv_s32, vmaxv_u8, vmaxv_u16, vmaxv_u32, vmaxvq_f32, vmaxvq_f64,
vmaxvq_s8, vmaxvq_s16, vmaxvq_s32, vmaxvq_u8, vmaxvq_u16, vmaxvq_u32,
vmaxnmv_f32, vmaxnmvq_f32, vmaxnmvq_f64, vminv_f32, vminv_s8,
vminv_s16, vminv_s32, vminv_u8, vminv_u16, vminv_u32, vminvq_f32,
vminvq_f64, vminvq_s8, vminvq_s16, vminvq_s32, vminvq_u8, vminvq_u16,
vminvq_u32, vminnmv_f32, vminnmvq_f32, vminnmvq_f64): Update to use
__builtin_aarch64_reduc_..._scal; remove vget_lane wrapper.
Marcus Shawcroft
2014-09-24 09:47:11 UTC
Permalink
Post by Alan Lawrence
Similarly to the previous patch (r/2205), this migrates AArch64 to the new
reduce-to-scalar optabs for min and max. For consistency we apply the same
treatment to the smax_nan and smin_nan patterns (used for __builtins), even
though reduc_smin_nan_scal (etc.) is not a standard name.
Tested: check-gcc on aarch64-none-elf and aarch64_be-none-elf.
* config/aarch64/aarch64-simd-builtins.def (reduc_smax_, reduc_smin_,
reduc_umax_, reduc_umin_, reduc_smax_nan_, reduc_smin_nan_): Remove.
(reduc_smax_scal_, reduc_smin_scal_, reduc_umax_scal_,
reduc_umin_scal_, reduc_smax_nan_scal_, reduc_smin_nan_scal_): New.
* config/aarch64/aarch64-simd.md
(reduc_<maxmin_uns>_<mode>): Rename VDQV_S variant to...
(reduc_<maxmin_uns>_internal<mode>): ...this.
(reduc_<maxmin_uns>_<mode>): New (VDQ_BHSI).
(reduc_<maxmin_uns>_scal_<mode>): New (*2).
(reduc_<maxmin_uns>_v2si): Combine with below, renaming...
(reduc_<maxmin_uns>_<mode>): Combine V2F with above, renaming...
(reduc_<maxmin_uns>_internal_<mode>): ...to this (VDQF).
* config/aarch64/arm_neon.h (vmaxv_f32, vmaxv_s8, vmaxv_s16,
vmaxv_s32, vmaxv_u8, vmaxv_u16, vmaxv_u32, vmaxvq_f32, vmaxvq_f64,
vmaxvq_s8, vmaxvq_s16, vmaxvq_s32, vmaxvq_u8, vmaxvq_u16, vmaxvq_u32,
vmaxnmv_f32, vmaxnmvq_f32, vmaxnmvq_f64, vminv_f32, vminv_s8,
vminv_s16, vminv_s32, vminv_u8, vminv_u16, vminv_u32, vminvq_f32,
vminvq_f64, vminvq_s8, vminvq_s16, vminvq_s32, vminvq_u8, vminvq_u16,
vminvq_u32, vminnmv_f32, vminnmvq_f32, vminnmvq_f64): Update to use
__builtin_aarch64_reduc_..._scal; remove vget_lane wrapper.
If we don;t need the old optabs, I think would be better to drop those
define_expands, otherwise OK.
/Marcus
Alan Lawrence
2014-09-18 12:05:10 UTC
Permalink
This gives us back the constant-folding of the neon-intrinsics that was removed
in the first patch, but is now OK for bigendian too.

bootstrapped on aarch64-none-linux-gnu.
check-gcc on aarch64-none-elf and aarch64_be-none-elf.

gcc/ChangeLog:

* config/aarch64/aarch64.c (TARGET_GIMPLE_FOLD_BUILTIN): Define again.
* config/aarch64/aarch64-builtins.c (aarch64_gimple_fold_builtin):
Restore, enable for bigendian, update to use __builtin..._scal...
Marcus Shawcroft
2014-09-24 09:48:09 UTC
Permalink
Post by Alan Lawrence
This gives us back the constant-folding of the neon-intrinsics that was
removed in the first patch, but is now OK for bigendian too.
bootstrapped on aarch64-none-linux-gnu.
check-gcc on aarch64-none-elf and aarch64_be-none-elf.
* config/aarch64/aarch64.c (TARGET_GIMPLE_FOLD_BUILTIN): Define again.
Restore, enable for bigendian, update to use __builtin..._scal...
OK /Marcus
Alan Lawrence
2014-09-18 12:19:03 UTC
Permalink
For reduction operations (e.g. multiply) that don't have such a tree code ,or
where the target platform doesn't define an optab handler for the tree code, we
can perform the reduction using a series of log(N) shifts (where N = #elements
in vector), using the VEC_RSHIFT_EXPR=whole-vector-shift tree code (if the
platform handles the vec_shr_optab).

First stage is to add some tests of non-(min/max/plus) reductions; here,
multiplies. The first is designed to be non-foldable, so we make sure the
architectural instructions line up with what the tree codes specify. The second
is designed to be easily constant-propagated, to test the (currently
endianness-dependent) constant folding code.

In lib/target-supports.exp, I've defined a new
check_effective_target_whole_vector_shift, which I intended to define to true
for platforms with the vec_shr optab. However, I've not managed to make this
test pass on PowerPC - even with -maltivec, -fdump-tree-vect-details gives me a
message about the target not supporting vector multiplication - so I've omitted
PowerPC from the whole_vector_shift. This doesn't feel right, suggestions
welcomed from PowerPC maintainers?

Tests passing on arm-none-eabi and x86_64-none-linux-gnu;
also verified the scan-tree-dump part works on ia64-none-linux-gnu (by compiling
to assembly only).
(Tests are not run on AArch64, because we have no vec_shr_optab at this point;
PowerPC, as above; or MIPS, as check_effective_target_vect_int_mult yields 0.)

gcc/testsuite/ChangeLog:

* lib/target-supports.exp (check_effective_target_whole_vector_shift):
New.

* gcc.dg/vect/vect-reduc-mul_1.c: New test.
* gcc.dg/vect/vect-reduc-mul_2.c: New test.
Richard Biener
2014-09-22 10:41:44 UTC
Permalink
Post by Alan Lawrence
For reduction operations (e.g. multiply) that don't have such a tree code
,or where the target platform doesn't define an optab handler for the tree
code, we can perform the reduction using a series of log(N) shifts (where N
= #elements in vector), using the VEC_RSHIFT_EXPR=whole-vector-shift tree
code (if the platform handles the vec_shr_optab).
First stage is to add some tests of non-(min/max/plus) reductions; here,
multiplies. The first is designed to be non-foldable, so we make sure the
architectural instructions line up with what the tree codes specify. The
second is designed to be easily constant-propagated, to test the (currently
endianness-dependent) constant folding code.
In lib/target-supports.exp, I've defined a new
check_effective_target_whole_vector_shift, which I intended to define to
true for platforms with the vec_shr optab. However, I've not managed to make
this test pass on PowerPC - even with -maltivec, -fdump-tree-vect-details
gives me a message about the target not supporting vector multiplication -
so I've omitted PowerPC from the whole_vector_shift. This doesn't feel
right, suggestions welcomed from PowerPC maintainers?
Tests passing on arm-none-eabi and x86_64-none-linux-gnu;
also verified the scan-tree-dump part works on ia64-none-linux-gnu (by
compiling to assembly only).
(Tests are not run on AArch64, because we have no vec_shr_optab at this
point; PowerPC, as above; or MIPS, as check_effective_target_vect_int_mult
yields 0.)
Ok.

Thanks,
Richard.
Post by Alan Lawrence
* lib/target-supports.exp
New.
* gcc.dg/vect/vect-reduc-mul_1.c: New test.
* gcc.dg/vect/vect-reduc-mul_2.c: New test.
Alan Lawrence
2014-09-18 12:25:07 UTC
Permalink
These are like the previous patch, but using | rather than * - I was unable to
get the previous test to pass on PowerPC and MIPS.

I note there is no inherent vector operation here - a bitwise OR across a word,
and a "reduction via shifts" using scalar (not vector) ops would be all that's
necessary. However, GCC doesn't exploit this possibility at present, and I don't
have any plans at present to add such myself.

Passing on x86_64-linux-gnu, aarch64-none-elf, aarch64_be-none-elf, arm-none-eabi.
The 'scan-tree-dump' part passes on mips64 and powerpc (although the latter is
disabled as check_effective_target_whole_vector_shift gives 0, as per previous
patch)

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-reduc-or_1.c: New test.
* gcc.dg/vect/vect-reduc-or_2.c: Likewise.
Richard Biener
2014-09-22 10:42:08 UTC
Permalink
Post by Alan Lawrence
These are like the previous patch, but using | rather than * - I was unable
to get the previous test to pass on PowerPC and MIPS.
I note there is no inherent vector operation here - a bitwise OR across a
word, and a "reduction via shifts" using scalar (not vector) ops would be
all that's necessary. However, GCC doesn't exploit this possibility at
present, and I don't have any plans at present to add such myself.
Passing on x86_64-linux-gnu, aarch64-none-elf, aarch64_be-none-elf, arm-none-eabi.
The 'scan-tree-dump' part passes on mips64 and powerpc (although the latter
is disabled as check_effective_target_whole_vector_shift gives 0, as per
previous patch)
Ok.

Thanks,
Richard.
Post by Alan Lawrence
* gcc.dg/vect/vect-reduc-or_1.c: New test.
* gcc.dg/vect/vect-reduc-or_2.c: Likewise.
Alan Lawrence
2014-09-18 12:27:51 UTC
Permalink
The VEC_RSHIFT_EXPR is only ever used by the vectorizer in tree-vect-loop.c
(vect_create_epilog_for_reduction), to shift the vector by a whole number of
elements. The tree code allows more general shifts but only for integral types.
This only causes pain and difficulty for backends (particularly for backends
with different endiannesses), and enforcing that restriction for integral types
too does no harm.

bootstrapped on aarch64-none-linux-gnu and x86-64-none-linux-gnu
check-gcc on aarch64-none-elf and x86_64-none-linux-gnu

gcc/ChangeLog:

* tree-cfg.c (verify_gimple_assign_binary): for VEC_RSHIFT_EXPR (and
VEC_LSHIFT_EXPR), require shifts to be by a whole number of elements
for all types, rather than only non-integral types.

* tree.def (VEC_LSHIFT_EXPR, VEC_RSHIFT_EXPR): Update comment.

* doc/md.texi (vec_shl_m, vec_shr_m): Update comment.
Richard Biener
2014-09-22 10:50:43 UTC
Permalink
Post by Alan Lawrence
The VEC_RSHIFT_EXPR is only ever used by the vectorizer in tree-vect-loop.c
(vect_create_epilog_for_reduction), to shift the vector by a whole number of
elements. The tree code allows more general shifts but only for integral
types. This only causes pain and difficulty for backends (particularly for
backends with different endiannesses), and enforcing that restriction for
integral types too does no harm.
bootstrapped on aarch64-none-linux-gnu and x86-64-none-linux-gnu
check-gcc on aarch64-none-elf and x86_64-none-linux-gnu
Hmm, but then (coming from the tree / gimple level) all shifts can
be expressed with a VEC_PERM_EXPR. And of course a general
whole-vector shift could be expressed using a VIEW_CONVERT_EXPR
to a 1-element integer vector and a regular [RL]SHIFT_EXPR and then
converting back.

So it seems to me that the vectorizer should instead emit a
VEC_PERM_EXPR (making sure the backends or the generic
vec_perm expansion code in optabs.c handles the whole-vector-shift
case in an optimal way).

The current VEC_RSHIFT_EXPR description lacks information
on what is shifted in btw (always zeros? the most significant bit (endian
dependent?!)).

So - can we instead remove VEC_[LR]SHIFT_EXPR? Seems that
VEC_LSHIFT_EXPR is unused anyway, and thus vec_shl_optabs
as well.

Thanks,
Richard.
Post by Alan Lawrence
* tree-cfg.c (verify_gimple_assign_binary): for VEC_RSHIFT_EXPR (and
VEC_LSHIFT_EXPR), require shifts to be by a whole number of elements
for all types, rather than only non-integral types.
* tree.def (VEC_LSHIFT_EXPR, VEC_RSHIFT_EXPR): Update comment.
* doc/md.texi (vec_shl_m, vec_shr_m): Update comment.
Alan Lawrence
2014-09-18 12:34:24 UTC
Permalink
This allows reduction of non-(plus|min|max) operations using log_2(N) shifts
rather than N vec_extracts; e.g. for example code

int
main (unsigned char argc, char **argv)
{
unsigned char in[16] = { 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31 };
unsigned char i = 0;
unsigned char sum = 1;

/* Prevent constant propagation of the entire loop below. */
asm volatile ("" : : : "memory");

for (i = 0; i < 16; i++)
sum *= in[i];

if (sum != 33)
__builtin_printf("Failed %d\n", sum);
}

(a simplified, less-general version of vect-reduc-mul_1.c) this gives

main:
ldr q0, .LC0
sub sp, sp, #16
str q0, [sp]
ldr q1, [sp]
movi v0.4s, 0
ext v2.16b, v1.16b, v0.16b, #8
mul v1.16b, v1.16b, v2.16b
ext v2.16b, v1.16b, v0.16b, #4
mul v1.16b, v2.16b, v1.16b
ext v2.16b, v1.16b, v0.16b, #2
mul v1.16b, v2.16b, v1.16b
ext v0.16b, v1.16b, v0.16b, #1
mul v0.16b, v0.16b, v1.16b
umov w1, v0.b[0]
cmp w1, 33
beq .L2
...

rather than previously:

main:
ldr q0, .LC0
sub sp, sp, #16
str q0, [sp]
ldr d1, [sp]
ldr d0, [sp, 8]
mul v0.8b, v0.8b, v1.8b
umov w0, v0.b[1]
umov w3, v0.b[0]
umov w2, v0.b[2]
umov w7, v0.b[3]
umov w6, v0.b[4]
mul w3, w0, w3
umov w5, v0.b[5]
umov w4, v0.b[6]
umov w1, v0.b[7]
mul w3, w3, w2
mul w2, w3, w7
mul w2, w2, w6
mul w0, w2, w5
mul w0, w0, w4
mul w1, w0, w1
uxtb w1, w1
cmp w1, 33
beq .L2
...


Tested check-gcc on aarch64-none-elf and aarch64_be-none-elf. (Including new
tests from previous patches.)

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (vec_shr<mode>): New (*2).

gcc/testsuite/ChangeLog:
* lib/target_supports.exp (check_effective_target_whole_vector_shift):
Add aarch64*-*-*.
Alan Lawrence
2014-09-18 12:35:36 UTC
Permalink
The VEC_LSHIFT_EXPR tree code, and the corresponding vec_shl_optab, seem to have
been added for completeness, providing a counterpart to VEC_RSHIFT_EXPR and
vec_shr_optab. However, whereas VEC_RSHIFT_EXPRs are generated (only) by the
vectorizer, VEC_LSHIFT_EXPR expressions are not generated at all, so there seems
little point in maintaining it.

Bootstrapped on x86_64-unknown-linux-gnu.
aarch64.exp+vect.exp on aarch64-none-elf and aarch64_be-none-elf.

gcc/ChangeLog:

* expr.c (expand_expr_real_2): Remove code handling VEC_LSHIFT_EXPR.
* fold-const.c (const_binop): Likewise.
* cfgexpand.c (expand_debug_expr): Likewise.
* tree-inline.c (estimate_operator_cost, dump_generic_node,
op_code_prio, op_symbol_code): Likewise.
* tree-vect-generic.c (expand_vector_operations_1): Likewise.
* optabs.c (optab_for_tree_code): Likewise.
(expand_vec_shift_expr): Likewise, update comment.
* tree.def: Delete VEC_LSHIFT_EXPR, remove comment.
* optabs.h (expand_vec_shift_expr): Remove comment re. VEC_LSHIFT_EXPR.
* optabs.def: Remove vec_shl_optab.
* doc/md.texi: Remove references to vec_shr_m.
Richard Biener
2014-09-22 10:52:03 UTC
Permalink
Post by Alan Lawrence
The VEC_LSHIFT_EXPR tree code, and the corresponding vec_shl_optab, seem to
have been added for completeness, providing a counterpart to VEC_RSHIFT_EXPR
and vec_shr_optab. However, whereas VEC_RSHIFT_EXPRs are generated (only) by
the vectorizer, VEC_LSHIFT_EXPR expressions are not generated at all, so
there seems little point in maintaining it.
Bootstrapped on x86_64-unknown-linux-gnu.
aarch64.exp+vect.exp on aarch64-none-elf and aarch64_be-none-elf.
Ah, there it is ;)

Ok.

Thanks,
Richard.
Post by Alan Lawrence
* expr.c (expand_expr_real_2): Remove code handling VEC_LSHIFT_EXPR.
* fold-const.c (const_binop): Likewise.
* cfgexpand.c (expand_debug_expr): Likewise.
* tree-inline.c (estimate_operator_cost, dump_generic_node,
op_code_prio, op_symbol_code): Likewise.
* tree-vect-generic.c (expand_vector_operations_1): Likewise.
* optabs.c (optab_for_tree_code): Likewise.
(expand_vec_shift_expr): Likewise, update comment.
* tree.def: Delete VEC_LSHIFT_EXPR, remove comment.
* optabs.h (expand_vec_shift_expr): Remove comment re.
VEC_LSHIFT_EXPR.
* optabs.def: Remove vec_shl_optab.
* doc/md.texi: Remove references to vec_shr_m.
Alan Lawrence
2014-09-18 12:42:59 UTC
Permalink
The direction of VEC_RSHIFT_EXPR has been endian-dependent, contrary to the
general principles of tree. This patch updates fold-const and the vectorizer
(the only place where such expressions are created), such that VEC_RSHIFT_EXPR
always shifts towards element 0.

The tree code still maps directly onto the vec_shr_optab, and so this patch
*will break any bigendian platform defining the vec_shr optab*.
--> For AArch64_be, patch follows next in series;
--> For PowerPC, I think patch/rfc 15 should fix, please inspect;
--> For MIPS, I think patch/rfc 16 should fix, please inspect.

gcc/ChangeLog:

* fold-const.c (const_binop): VEC_RSHIFT_EXPR always shifts towards
element 0.

* tree-vect-loop.c (vect_create_epilog_for_reduction): always extract
the result of a reduction with vector shifts from element 0.

* tree.def (VEC_RSHIFT_EXPR, VEC_LSHIFT_EXPR): Comment shift direction.

* doc/md.texi (vec_shr_m, vec_shl_m): Document shift direction.

Testing Done:

Bootstrap and check-gcc on x86_64-none-linux-gnu; check-gcc on aarch64-none-elf.
David Edelsohn
2014-09-18 13:12:40 UTC
Permalink
Post by Alan Lawrence
The direction of VEC_RSHIFT_EXPR has been endian-dependent, contrary to the
general principles of tree. This patch updates fold-const and the vectorizer
(the only place where such expressions are created), such that
VEC_RSHIFT_EXPR always shifts towards element 0.
The tree code still maps directly onto the vec_shr_optab, and so this patch
*will break any bigendian platform defining the vec_shr optab*.
--> For AArch64_be, patch follows next in series;
--> For PowerPC, I think patch/rfc 15 should fix, please inspect;
--> For MIPS, I think patch/rfc 16 should fix, please inspect.
* fold-const.c (const_binop): VEC_RSHIFT_EXPR always shifts towards
element 0.
* tree-vect-loop.c (vect_create_epilog_for_reduction): always extract
the result of a reduction with vector shifts from element 0.
* tree.def (VEC_RSHIFT_EXPR, VEC_LSHIFT_EXPR): Comment shift direction.
* doc/md.texi (vec_shr_m, vec_shl_m): Document shift direction.
Bootstrap and check-gcc on x86_64-none-linux-gnu; check-gcc on
aarch64-none-elf.
Why wasn't this tested on the PowerLinux system in the GCC Compile Farm?

Also, Bill Schmidt can help check the PPC parts fo the patches.

Thanks, David
Bill Schmidt
2014-09-22 13:27:04 UTC
Permalink
Post by David Edelsohn
Post by Alan Lawrence
The direction of VEC_RSHIFT_EXPR has been endian-dependent, contrary to the
general principles of tree. This patch updates fold-const and the vectorizer
(the only place where such expressions are created), such that
VEC_RSHIFT_EXPR always shifts towards element 0.
The tree code still maps directly onto the vec_shr_optab, and so this patch
*will break any bigendian platform defining the vec_shr optab*.
--> For AArch64_be, patch follows next in series;
--> For PowerPC, I think patch/rfc 15 should fix, please inspect;
--> For MIPS, I think patch/rfc 16 should fix, please inspect.
* fold-const.c (const_binop): VEC_RSHIFT_EXPR always shifts towards
element 0.
* tree-vect-loop.c (vect_create_epilog_for_reduction): always extract
the result of a reduction with vector shifts from element 0.
* tree.def (VEC_RSHIFT_EXPR, VEC_LSHIFT_EXPR): Comment shift direction.
* doc/md.texi (vec_shr_m, vec_shl_m): Document shift direction.
Bootstrap and check-gcc on x86_64-none-linux-gnu; check-gcc on aarch64-none-elf.
Why wasn't this tested on the PowerLinux system in the GCC Compile Farm?
Also, Bill Schmidt can help check the PPC parts fo the patches.
Sorry for the late response; I just returned from vacation. I think
that patch 15 looks reasonable on the surface, but would be more
comfortable if it had been tested. I would echo David's suggestion that
you please test this on gcc110 in the compile farm to avoid surprises.
Given the similarity between vec_shl_<mode> and vec_shr_<mode> I am ok
with removing the former; it won't be difficult to re-create it later if
needed.

Please add some of the language you used above about VEC_RSHIFT_EXPR as
commentary for vec_shr_<mode> in vector.md, as right-shifting towards
element zero is not an obvious concept on a BE machine.

Thanks,
Bill
Post by David Edelsohn
Thanks, David
Richard Biener
2014-09-22 10:58:05 UTC
Permalink
Post by Alan Lawrence
The direction of VEC_RSHIFT_EXPR has been endian-dependent, contrary to the
general principles of tree. This patch updates fold-const and the vectorizer
(the only place where such expressions are created), such that
VEC_RSHIFT_EXPR always shifts towards element 0.
The tree code still maps directly onto the vec_shr_optab, and so this patch
*will break any bigendian platform defining the vec_shr optab*.
--> For AArch64_be, patch follows next in series;
--> For PowerPC, I think patch/rfc 15 should fix, please inspect;
--> For MIPS, I think patch/rfc 16 should fix, please inspect.
* fold-const.c (const_binop): VEC_RSHIFT_EXPR always shifts towards
element 0.
* tree-vect-loop.c (vect_create_epilog_for_reduction): always extract
the result of a reduction with vector shifts from element 0.
* tree.def (VEC_RSHIFT_EXPR, VEC_LSHIFT_EXPR): Comment shift direction.
* doc/md.texi (vec_shr_m, vec_shl_m): Document shift direction.
Bootstrap and check-gcc on x86_64-none-linux-gnu; check-gcc on
aarch64-none-elf.
As said elsewhere I'd like the vectorizer to use VEC_PERM_EXPRs
and the generic vec_perm expansion machinery handle the
case where the permute can be expressed using the vec_shr_optab.
You'd have, for a 1-element shift of V4SI x, VEC_PERM <x, { 0, 0, 0, 0
}, {4, 3, 2, 1 }>

I'd say that if the target says it can handle the constant permute just fine
then use the vec_perm_const expansion path.

Richard.
Alan Lawrence
2014-09-18 12:45:24 UTC
Permalink
The previous patch broke aarch64_be by redefining VEC_RSHIFT_EXPR /
vec_shr_optab to always shift the vector towards gcc's element 0. This fixes
aarch64_be to do that.

check-gcc on aarch64-none-elf (no changes) and aarch64_be-none-elf (fixes all
regressions produced by previous patch, i.e. no regressions from before
redefining vec_shr).


gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (vec_shr_<mode> *2): Fix bigendian.
Richard Biener
2014-09-22 10:52:46 UTC
Permalink
Post by Alan Lawrence
The previous patch broke aarch64_be by redefining VEC_RSHIFT_EXPR /
vec_shr_optab to always shift the vector towards gcc's element 0. This fixes
aarch64_be to do that.
check-gcc on aarch64-none-elf (no changes) and aarch64_be-none-elf (fixes
all regressions produced by previous patch, i.e. no regressions from before
redefining vec_shr).
Using vector permutes would have avoided this I guess?

Richard.
Post by Alan Lawrence
* config/aarch64/aarch64-simd.md (vec_shr_<mode> *2): Fix bigendian.
Alan Lawrence
2014-09-18 12:48:08 UTC
Permalink
Following earlier patches, vect_create_epilog_for_reduction contains exactly one
case where extract_scalar_result==true. Hence, move the code 'if
(extract_scalar_result)' there, and tidy-up/remove some variables.

bootstrapped on x86_64-none-linux-gnu + check-gcc + check-g++.

gcc/ChangeLog:

* tree-vect-loop.c (vect_create_epilog_for_reduction): Move code for
'if (extract_scalar_result)' to the only place that it is true.
Richard Biener
2014-09-22 10:53:42 UTC
Permalink
Post by Alan Lawrence
Following earlier patches, vect_create_epilog_for_reduction contains exactly
one case where extract_scalar_result==true. Hence, move the code 'if
(extract_scalar_result)' there, and tidy-up/remove some variables.
bootstrapped on x86_64-none-linux-gnu + check-gcc + check-g++.
Ok.

Thanks,
Richard.
Post by Alan Lawrence
* tree-vect-loop.c (vect_create_epilog_for_reduction): Move code for
'if (extract_scalar_result)' to the only place that it is true.
Alan Lawrence
2014-09-18 12:57:54 UTC
Permalink
Patch 12 of 14 (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01475.html) will
break bigendian targets implementing vec_shr. This is a PowerPC parallel of
patch 13 of 14 (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01477.html) for
AArch64. I've checked I can build a stage 1 compiler for powerpc-none-eabi and
that the assembly output looks plausible but no further than that.

In fact I find BYTES_BIG_ENDIAN is defined to true on powerpcle-none-eabi as
well as powerpc-none-eabi (and also on ppc64-none-elf, but to false on
ppc64le-none-elf), so I'm not quite sure how your backend works in this regard -
nonetheless I hope this is a helpful starting point even if not definitive.

gcc/ChangeLog:

* config/rs6000/vector.md (vec_shl_<mode>): Remove.
(vec_shr_<mode>): Reverse shift if BYTES_BIG_ENDIAN.
David Edelsohn
2014-09-23 12:50:10 UTC
Permalink
Post by Alan Lawrence
Patch 12 of 14 (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01475.html)
will break bigendian targets implementing vec_shr. This is a PowerPC
parallel of patch 13 of 14
(https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01477.html) for AArch64. I've
checked I can build a stage 1 compiler for powerpc-none-eabi and that the
assembly output looks plausible but no further than that.
In fact I find BYTES_BIG_ENDIAN is defined to true on powerpcle-none-eabi as
well as powerpc-none-eabi (and also on ppc64-none-elf, but to false on
ppc64le-none-elf), so I'm not quite sure how your backend works in this
regard - nonetheless I hope this is a helpful starting point even if not
definitive.
* config/rs6000/vector.md (vec_shl_<mode>): Remove.
(vec_shr_<mode>): Reverse shift if BYTES_BIG_ENDIAN.
This patch is okay if no regressions on a PowerLinux system (either
you or Segher can test on the GCC Compile Farm).

Thanks, David
Alan Lawrence
2014-09-18 13:02:15 UTC
Permalink
Patch 12 of 14 (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01475.html) will
break bigendian targets implementing vec_shr. This is a MIPS parallel of
patch 13 of 14 (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01477.html) for
AArch64; the idea is that vec_shr should be unaffected on little-endian, but
reversed (to be the same as the old vec_shl) if big-endian.

Manual inspection of assembler output looks to do the right sort of thing on
mips and mips64, but I haven't been able to run any testcases so this is not
definitive. I'm hoping it is nonetheless helpful as a starting point!

gcc/ChangeLog:

* config/mips/loongson.md (unspec): Remove UNSPEC_LOONGSON_DSLL.
(vec_shl_<mode>): Remove.
(vec_shr_<mode>): Reverse shift if BYTES_BIG_ENDIAN.
Richard Biener
2014-09-22 11:21:01 UTC
Permalink
Post by Alan Lawrence
The end goal here is to remove this code from tree-vect-loop.c
if (BYTES_BIG_ENDIAN)
bitpos = size_binop (MULT_EXPR,
bitsize_int (TYPE_VECTOR_SUBPARTS (vectype) - 1),
TYPE_SIZE (scalar_type));
else
as this is the root cause of PR/61114 (see testcase there, failing on all
bigendian targets supporting reduc_[us]plus_optab). Quoting Richard Biener,
"all code conditional on BYTES/WORDS_BIG_ENDIAN in tree-vect* is
(Path 1) (patches 1-6) Reductions using REDUC_(PLUS|MIN|MAX)_EXPR =
reduc_[us](plus|min|max)_optab.
The optab is documented as "the scalar result is stored in the least
significant bits of operand 0", but the tree code as "the first element in
the vector holding the result of the reduction of all elements of the
operand". This mismatch means that when the tree code is folded, the code
snippet above reads the result from the wrong end of the vector.
The strategy (as per
https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html) is to define new
tree codes and optabs that produce scalar results directly; this seems
better than tying (the element of the vector into which the result is
placed) to (the endianness of the target), and avoids generating extra moves
on current bigendian targets. However, the previous optabs are retained for
now as a migration strategy so as not to break existing backends; moving
individual platforms over will follow.
A complication here is on AArch64, where we directly generate
REDUC_PLUS_EXPRs from intrinsics in gimple_fold_builtin; I temporarily
remove this folding in order to decouple the midend and AArch64 backend.
Sounds fine. I hope we can transition all backends for 5.0 and remove
the vector variant optabs (maybe renaming the scalar ones).
Post by Alan Lawrence
(Path 2) (patches 7-13) Reductions using whole-vector-shifts, i.e.
VEC_RSHIFT_EXPR and vec_shr_optab. Here the tree code as well as the optab
is defined in an endianness-dependent way, leading to significant
complication in fold-const.c. (Moreover, the "equivalent" vec_shl_optab is
never used!). Few platforms appear to handle vec_shr_optab (and fewer
bigendian - I see only PowerPC and MIPS), so it seems pertinent to change
the existing optab to be endianness-neutral.
Patch 10 defines vec_shr for AArch64, for the old specification; patch 13
updates that implementation to fit the new endianness-neutral specification,
serving as a guide for other existing backends. Patches/RFCs 15 and 16 are
equivalents for MIPS and PowerPC; I haven't tested these but hope they act
as useful pointers for the port maintainers.
Finally patch 14 cleans up the affected part of tree-vect-loop.c
(vect_create_epilog_for_reduction).
As said during the individual patches review I'd like the vectorizer to
use a VEC_PERM_EXPR instead of VEC_RSHIFT_EXPR (with
only whole-element amounts). This means we can remove
VEC_RSHIFT_EXPR. It also means that if the backend defines
vec_perm_const (which it really should) it can handle the special
permutes that boil down to a possibly more efficient vector shift
there (a good optimization anyway). Until it does that all backends
would at least create correct code (with the endian dependent
vec_shr removed).

Richard.
Post by Alan Lawrence
--Alan
Richard Biener
2014-09-22 11:26:31 UTC
Permalink
On Mon, Sep 22, 2014 at 1:21 PM, Richard Biener
Post by Richard Biener
Post by Alan Lawrence
The end goal here is to remove this code from tree-vect-loop.c
if (BYTES_BIG_ENDIAN)
bitpos = size_binop (MULT_EXPR,
bitsize_int (TYPE_VECTOR_SUBPARTS (vectype) - 1),
TYPE_SIZE (scalar_type));
else
as this is the root cause of PR/61114 (see testcase there, failing on all
bigendian targets supporting reduc_[us]plus_optab). Quoting Richard Biener,
"all code conditional on BYTES/WORDS_BIG_ENDIAN in tree-vect* is
(Path 1) (patches 1-6) Reductions using REDUC_(PLUS|MIN|MAX)_EXPR =
reduc_[us](plus|min|max)_optab.
The optab is documented as "the scalar result is stored in the least
significant bits of operand 0", but the tree code as "the first element in
the vector holding the result of the reduction of all elements of the
operand". This mismatch means that when the tree code is folded, the code
snippet above reads the result from the wrong end of the vector.
The strategy (as per
https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html) is to define new
tree codes and optabs that produce scalar results directly; this seems
better than tying (the element of the vector into which the result is
placed) to (the endianness of the target), and avoids generating extra moves
on current bigendian targets. However, the previous optabs are retained for
now as a migration strategy so as not to break existing backends; moving
individual platforms over will follow.
A complication here is on AArch64, where we directly generate
REDUC_PLUS_EXPRs from intrinsics in gimple_fold_builtin; I temporarily
remove this folding in order to decouple the midend and AArch64 backend.
Sounds fine. I hope we can transition all backends for 5.0 and remove
the vector variant optabs (maybe renaming the scalar ones).
Post by Alan Lawrence
(Path 2) (patches 7-13) Reductions using whole-vector-shifts, i.e.
VEC_RSHIFT_EXPR and vec_shr_optab. Here the tree code as well as the optab
is defined in an endianness-dependent way, leading to significant
complication in fold-const.c. (Moreover, the "equivalent" vec_shl_optab is
never used!). Few platforms appear to handle vec_shr_optab (and fewer
bigendian - I see only PowerPC and MIPS), so it seems pertinent to change
the existing optab to be endianness-neutral.
Patch 10 defines vec_shr for AArch64, for the old specification; patch 13
updates that implementation to fit the new endianness-neutral specification,
serving as a guide for other existing backends. Patches/RFCs 15 and 16 are
equivalents for MIPS and PowerPC; I haven't tested these but hope they act
as useful pointers for the port maintainers.
Finally patch 14 cleans up the affected part of tree-vect-loop.c
(vect_create_epilog_for_reduction).
As said during the individual patches review I'd like the vectorizer to
use a VEC_PERM_EXPR instead of VEC_RSHIFT_EXPR (with
only whole-element amounts). This means we can remove
VEC_RSHIFT_EXPR. It also means that if the backend defines
vec_perm_const (which it really should) it can handle the special
permutes that boil down to a possibly more efficient vector shift
there (a good optimization anyway). Until it does that all backends
would at least create correct code (with the endian dependent
vec_shr removed).
It seems only Alpha completely lacks vec_perm_const but implements
vec_shr.

Richard.
Post by Richard Biener
Richard.
Post by Alan Lawrence
--Alan
Alan Lawrence
2014-10-06 17:31:07 UTC
Permalink
Ok, so unless there are objections, I plan to commit patches 1, 2, 4, 5, and 6,
which have been previously approved, in that sequence. (Of those, all bar patch
2 are AArch64 only.) I think this is better than maintaining an ever-expanding
patch series.

Then I'll get to work on migrating all backends to the new _scal_ optab (and
removing the vector optab). Certainly I'd like to replace vec_shr/l with
vec_perm_expr too, but I'm conscious that the end of stage 1 is approaching!

--Alan
Post by Richard Biener
Post by Alan Lawrence
The end goal here is to remove this code from tree-vect-loop.c
if (BYTES_BIG_ENDIAN)
bitpos = size_binop (MULT_EXPR,
bitsize_int (TYPE_VECTOR_SUBPARTS (vectype) - 1),
TYPE_SIZE (scalar_type));
else
as this is the root cause of PR/61114 (see testcase there, failing on all
bigendian targets supporting reduc_[us]plus_optab). Quoting Richard Biener,
"all code conditional on BYTES/WORDS_BIG_ENDIAN in tree-vect* is
(Path 1) (patches 1-6) Reductions using REDUC_(PLUS|MIN|MAX)_EXPR =
reduc_[us](plus|min|max)_optab.
The optab is documented as "the scalar result is stored in the least
significant bits of operand 0", but the tree code as "the first element in
the vector holding the result of the reduction of all elements of the
operand". This mismatch means that when the tree code is folded, the code
snippet above reads the result from the wrong end of the vector.
The strategy (as per
https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html) is to define new
tree codes and optabs that produce scalar results directly; this seems
better than tying (the element of the vector into which the result is
placed) to (the endianness of the target), and avoids generating extra moves
on current bigendian targets. However, the previous optabs are retained for
now as a migration strategy so as not to break existing backends; moving
individual platforms over will follow.
A complication here is on AArch64, where we directly generate
REDUC_PLUS_EXPRs from intrinsics in gimple_fold_builtin; I temporarily
remove this folding in order to decouple the midend and AArch64 backend.
Sounds fine. I hope we can transition all backends for 5.0 and remove
the vector variant optabs (maybe renaming the scalar ones).
Post by Alan Lawrence
(Path 2) (patches 7-13) Reductions using whole-vector-shifts, i.e.
VEC_RSHIFT_EXPR and vec_shr_optab. Here the tree code as well as the optab
is defined in an endianness-dependent way, leading to significant
complication in fold-const.c. (Moreover, the "equivalent" vec_shl_optab is
never used!). Few platforms appear to handle vec_shr_optab (and fewer
bigendian - I see only PowerPC and MIPS), so it seems pertinent to change
the existing optab to be endianness-neutral.
Patch 10 defines vec_shr for AArch64, for the old specification; patch 13
updates that implementation to fit the new endianness-neutral specification,
serving as a guide for other existing backends. Patches/RFCs 15 and 16 are
equivalents for MIPS and PowerPC; I haven't tested these but hope they act
as useful pointers for the port maintainers.
Finally patch 14 cleans up the affected part of tree-vect-loop.c
(vect_create_epilog_for_reduction).
As said during the individual patches review I'd like the vectorizer to
use a VEC_PERM_EXPR instead of VEC_RSHIFT_EXPR (with
only whole-element amounts). This means we can remove
VEC_RSHIFT_EXPR. It also means that if the backend defines
vec_perm_const (which it really should) it can handle the special
permutes that boil down to a possibly more efficient vector shift
there (a good optimization anyway). Until it does that all backends
would at least create correct code (with the endian dependent
vec_shr removed).
Richard.
Post by Alan Lawrence
--Alan
Richard Biener
2014-10-07 07:45:41 UTC
Permalink
Post by Alan Lawrence
Ok, so unless there are objections, I plan to commit patches 1, 2, 4, 5, and
6,
which have been previously approved, in that sequence. (Of those, all bar
patch
2 are AArch64 only.) I think this is better than maintaining an
ever-expanding
patch series.
Agreed.
Post by Alan Lawrence
Then I'll get to work on migrating all backends to the new _scal_ optab (and
removing the vector optab). Certainly I'd like to replace vec_shr/l with
vec_perm_expr too, but I'm conscious that the end of stage 1 is approaching!
I suppose we all are. It will last until end of October at least
(stage1 of gcc 4.9
ended Nov 22th, certainly a bit late).

I do expect we will continue merging already developed / posted stuff through
stage3 (as usual).

That said, it would be really nice to get rid of VEC_RSHIFT_EXPR.

Thanks,
Richard.
Post by Alan Lawrence
--Alan
Post by Richard Biener
Post by Alan Lawrence
The end goal here is to remove this code from tree-vect-loop.c
if (BYTES_BIG_ENDIAN)
bitpos = size_binop (MULT_EXPR,
bitsize_int (TYPE_VECTOR_SUBPARTS (vectype)
-
1),
TYPE_SIZE (scalar_type));
else
as this is the root cause of PR/61114 (see testcase there, failing on all
bigendian targets supporting reduc_[us]plus_optab). Quoting Richard Biener,
"all code conditional on BYTES/WORDS_BIG_ENDIAN in tree-vect* is
(Path 1) (patches 1-6) Reductions using REDUC_(PLUS|MIN|MAX)_EXPR =
reduc_[us](plus|min|max)_optab.
The optab is documented as "the scalar result is stored in the least
significant bits of operand 0", but the tree code as "the first element in
the vector holding the result of the reduction of all elements of the
operand". This mismatch means that when the tree code is folded, the code
snippet above reads the result from the wrong end of the vector.
The strategy (as per
https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html) is to define new
tree codes and optabs that produce scalar results directly; this seems
better than tying (the element of the vector into which the result is
placed) to (the endianness of the target), and avoids generating extra moves
on current bigendian targets. However, the previous optabs are retained for
now as a migration strategy so as not to break existing backends; moving
individual platforms over will follow.
A complication here is on AArch64, where we directly generate
REDUC_PLUS_EXPRs from intrinsics in gimple_fold_builtin; I temporarily
remove this folding in order to decouple the midend and AArch64 backend.
Sounds fine. I hope we can transition all backends for 5.0 and remove
the vector variant optabs (maybe renaming the scalar ones).
Post by Alan Lawrence
(Path 2) (patches 7-13) Reductions using whole-vector-shifts, i.e.
VEC_RSHIFT_EXPR and vec_shr_optab. Here the tree code as well as the optab
is defined in an endianness-dependent way, leading to significant
complication in fold-const.c. (Moreover, the "equivalent" vec_shl_optab is
never used!). Few platforms appear to handle vec_shr_optab (and fewer
bigendian - I see only PowerPC and MIPS), so it seems pertinent to change
the existing optab to be endianness-neutral.
Patch 10 defines vec_shr for AArch64, for the old specification; patch 13
updates that implementation to fit the new endianness-neutral specification,
serving as a guide for other existing backends. Patches/RFCs 15 and 16 are
equivalents for MIPS and PowerPC; I haven't tested these but hope they act
as useful pointers for the port maintainers.
Finally patch 14 cleans up the affected part of tree-vect-loop.c
(vect_create_epilog_for_reduction).
As said during the individual patches review I'd like the vectorizer to
use a VEC_PERM_EXPR instead of VEC_RSHIFT_EXPR (with
only whole-element amounts). This means we can remove
VEC_RSHIFT_EXPR. It also means that if the backend defines
vec_perm_const (which it really should) it can handle the special
permutes that boil down to a possibly more efficient vector shift
there (a good optimization anyway). Until it does that all backends
would at least create correct code (with the endian dependent
vec_shr removed).
Richard.
Post by Alan Lawrence
--Alan
-- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended
recipient, please notify the sender immediately and do not disclose the
contents to any other person, use it for any purpose, or store or copy the
information in any medium. Thank you.
ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
Registered in England & Wales, Company No: 2548782
Richard Biener
2014-10-07 07:46:53 UTC
Permalink
On Tue, Oct 7, 2014 at 9:45 AM, Richard Biener
Post by Richard Biener
Post by Alan Lawrence
Ok, so unless there are objections, I plan to commit patches 1, 2, 4, 5, and
6,
which have been previously approved, in that sequence. (Of those, all bar
patch
2 are AArch64 only.) I think this is better than maintaining an
ever-expanding
patch series.
Agreed.
Post by Alan Lawrence
Then I'll get to work on migrating all backends to the new _scal_ optab (and
removing the vector optab). Certainly I'd like to replace vec_shr/l with
vec_perm_expr too, but I'm conscious that the end of stage 1 is approaching!
I suppose we all are. It will last until end of October at least
(stage1 of gcc 4.9
ended Nov 22th, certainly a bit late).
I do expect we will continue merging already developed / posted stuff through
stage3 (as usual).
That said, it would be really nice to get rid of VEC_RSHIFT_EXPR.
And you can fix performance regressions you introduce (badly handled
VEC_PERM) until the GCC 5 release happens (and even after that).
Heh. Easy way out ;)

Richard.
Post by Richard Biener
Thanks,
Richard.
Post by Alan Lawrence
--Alan
Post by Richard Biener
Post by Alan Lawrence
The end goal here is to remove this code from tree-vect-loop.c
if (BYTES_BIG_ENDIAN)
bitpos = size_binop (MULT_EXPR,
bitsize_int (TYPE_VECTOR_SUBPARTS (vectype)
-
1),
TYPE_SIZE (scalar_type));
else
as this is the root cause of PR/61114 (see testcase there, failing on all
bigendian targets supporting reduc_[us]plus_optab). Quoting Richard Biener,
"all code conditional on BYTES/WORDS_BIG_ENDIAN in tree-vect* is
(Path 1) (patches 1-6) Reductions using REDUC_(PLUS|MIN|MAX)_EXPR =
reduc_[us](plus|min|max)_optab.
The optab is documented as "the scalar result is stored in the least
significant bits of operand 0", but the tree code as "the first element in
the vector holding the result of the reduction of all elements of the
operand". This mismatch means that when the tree code is folded, the code
snippet above reads the result from the wrong end of the vector.
The strategy (as per
https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html) is to define new
tree codes and optabs that produce scalar results directly; this seems
better than tying (the element of the vector into which the result is
placed) to (the endianness of the target), and avoids generating extra moves
on current bigendian targets. However, the previous optabs are retained for
now as a migration strategy so as not to break existing backends; moving
individual platforms over will follow.
A complication here is on AArch64, where we directly generate
REDUC_PLUS_EXPRs from intrinsics in gimple_fold_builtin; I temporarily
remove this folding in order to decouple the midend and AArch64 backend.
Sounds fine. I hope we can transition all backends for 5.0 and remove
the vector variant optabs (maybe renaming the scalar ones).
Post by Alan Lawrence
(Path 2) (patches 7-13) Reductions using whole-vector-shifts, i.e.
VEC_RSHIFT_EXPR and vec_shr_optab. Here the tree code as well as the optab
is defined in an endianness-dependent way, leading to significant
complication in fold-const.c. (Moreover, the "equivalent" vec_shl_optab is
never used!). Few platforms appear to handle vec_shr_optab (and fewer
bigendian - I see only PowerPC and MIPS), so it seems pertinent to change
the existing optab to be endianness-neutral.
Patch 10 defines vec_shr for AArch64, for the old specification; patch 13
updates that implementation to fit the new endianness-neutral specification,
serving as a guide for other existing backends. Patches/RFCs 15 and 16 are
equivalents for MIPS and PowerPC; I haven't tested these but hope they act
as useful pointers for the port maintainers.
Finally patch 14 cleans up the affected part of tree-vect-loop.c
(vect_create_epilog_for_reduction).
As said during the individual patches review I'd like the vectorizer to
use a VEC_PERM_EXPR instead of VEC_RSHIFT_EXPR (with
only whole-element amounts). This means we can remove
VEC_RSHIFT_EXPR. It also means that if the backend defines
vec_perm_const (which it really should) it can handle the special
permutes that boil down to a possibly more efficient vector shift
there (a good optimization anyway). Until it does that all backends
would at least create correct code (with the endian dependent
vec_shr removed).
Richard.
Post by Alan Lawrence
--Alan
-- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended
recipient, please notify the sender immediately and do not disclose the
contents to any other person, use it for any purpose, or store or copy the
information in any medium. Thank you.
ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
Registered in England & Wales, Company No: 2548782
Alan Lawrence
2014-10-09 17:10:46 UTC
Permalink
Ok....well, I see a path forward, somewhere there....

however (bah), I can't push that subset of patches - I came back from a week's
holiday and misremembered - the AArch64 changes depend upon the introduction of
the _scal_optabs, not just the tree changes :( .

I'll try to post optab migration patches next week for x86, rs6000 (mostly, I
haven't figured out paired.md yet), ARM, and IA64 (fwiw; has only v2sf
reductions). Having looked at MIPS/Loongson I'm feeling a bit bewildered and not
sure how to proceed, so I think I must ask the MIPS maintainers (CC'd) for
assistance: how can one add a vec_extract, to produce a scalar result, to the
end of each reduc_ optab ?

--Alan
Post by Richard Biener
Post by Alan Lawrence
Ok, so unless there are objections, I plan to commit patches 1, 2, 4, 5, and
6,
which have been previously approved, in that sequence. (Of those, all bar
patch
2 are AArch64 only.) I think this is better than maintaining an
ever-expanding
patch series.
Agreed.
Post by Alan Lawrence
Then I'll get to work on migrating all backends to the new _scal_ optab (and
removing the vector optab). Certainly I'd like to replace vec_shr/l with
vec_perm_expr too, but I'm conscious that the end of stage 1 is approaching!
I suppose we all are. It will last until end of October at least
(stage1 of gcc 4.9
ended Nov 22th, certainly a bit late).
I do expect we will continue merging already developed / posted stuff through
stage3 (as usual).
That said, it would be really nice to get rid of VEC_RSHIFT_EXPR.
Thanks,
Richard.
Post by Alan Lawrence
--Alan
Post by Richard Biener
Post by Alan Lawrence
The end goal here is to remove this code from tree-vect-loop.c
if (BYTES_BIG_ENDIAN)
bitpos = size_binop (MULT_EXPR,
bitsize_int (TYPE_VECTOR_SUBPARTS (vectype)
-
1),
TYPE_SIZE (scalar_type));
else
as this is the root cause of PR/61114 (see testcase there, failing on all
bigendian targets supporting reduc_[us]plus_optab). Quoting Richard Biener,
"all code conditional on BYTES/WORDS_BIG_ENDIAN in tree-vect* is
(Path 1) (patches 1-6) Reductions using REDUC_(PLUS|MIN|MAX)_EXPR =
reduc_[us](plus|min|max)_optab.
The optab is documented as "the scalar result is stored in the least
significant bits of operand 0", but the tree code as "the first element in
the vector holding the result of the reduction of all elements of the
operand". This mismatch means that when the tree code is folded, the code
snippet above reads the result from the wrong end of the vector.
The strategy (as per
https://gcc.gnu.org/ml/gcc-patches/2014-08/msg00041.html) is to define new
tree codes and optabs that produce scalar results directly; this seems
better than tying (the element of the vector into which the result is
placed) to (the endianness of the target), and avoids generating extra
moves
on current bigendian targets. However, the previous optabs are retained for
now as a migration strategy so as not to break existing backends; moving
individual platforms over will follow.
A complication here is on AArch64, where we directly generate
REDUC_PLUS_EXPRs from intrinsics in gimple_fold_builtin; I temporarily
remove this folding in order to decouple the midend and AArch64 backend.
Sounds fine. I hope we can transition all backends for 5.0 and remove
the vector variant optabs (maybe renaming the scalar ones).
Post by Alan Lawrence
(Path 2) (patches 7-13) Reductions using whole-vector-shifts, i.e.
VEC_RSHIFT_EXPR and vec_shr_optab. Here the tree code as well as the optab
is defined in an endianness-dependent way, leading to significant
complication in fold-const.c. (Moreover, the "equivalent" vec_shl_optab is
never used!). Few platforms appear to handle vec_shr_optab (and fewer
bigendian - I see only PowerPC and MIPS), so it seems pertinent to change
the existing optab to be endianness-neutral.
Patch 10 defines vec_shr for AArch64, for the old specification; patch 13
updates that implementation to fit the new endianness-neutral
specification,
serving as a guide for other existing backends. Patches/RFCs 15 and 16 are
equivalents for MIPS and PowerPC; I haven't tested these but hope they act
as useful pointers for the port maintainers.
Finally patch 14 cleans up the affected part of tree-vect-loop.c
(vect_create_epilog_for_reduction).
As said during the individual patches review I'd like the vectorizer to
use a VEC_PERM_EXPR instead of VEC_RSHIFT_EXPR (with
only whole-element amounts). This means we can remove
VEC_RSHIFT_EXPR. It also means that if the backend defines
vec_perm_const (which it really should) it can handle the special
permutes that boil down to a possibly more efficient vector shift
there (a good optimization anyway). Until it does that all backends
would at least create correct code (with the endian dependent
vec_shr removed).
Richard.
Post by Alan Lawrence
--Alan
Loading...