Discussion:
[i386] Replace builtins with vector extensions
Marc Glisse
2014-04-11 20:09:44 UTC
Permalink
Hello,

the previous discussion on the topic was before we added all those #pragma
target in *mmintrin.h:

http://gcc.gnu.org/ml/gcc-patches/2013-04/msg00374.html

I believe that removes a large part of the arguments against it. Note that
I only did a few of the more obvious intrinsics, I am waiting to see if
this patch is accepted before doing more.

Bootstrap+testsuite on x86_64-linux-gnu.

2014-04-11 Marc Glisse <***@inria.fr>

* config/i386/xmmintrin.h (_mm_add_ps, _mm_sub_ps, _mm_mul_ps,
_mm_div_ps, _mm_store_ss, _mm_cvtss_f32): Use vector extensions
instead of builtins.
* config/i386/emmintrin.h (_mm_store_sd, _mm_cvtsd_f64, _mm_storeh_pd,
_mm_cvtsi128_si64, _mm_cvtsi128_si64x, _mm_add_pd, _mm_sub_pd,
_mm_mul_pd, _mm_div_pd, _mm_storel_epi64, _mm_movepi64_pi64,
_mm_loadh_pd, _mm_loadl_pd): Likewise.
(_mm_sqrt_sd): Fix comment.
--
Marc Glisse
Marc Glisse
2014-04-28 11:34:45 UTC
Permalink
Ping
http://gcc.gnu.org/ml/gcc-patches/2014-04/msg00590.html

(note that ARM seems to be doing the same thing for their neon
intrinsics, see Ramana's patch series posted today)
Post by Marc Glisse
Hello,
the previous discussion on the topic was before we added all those #pragma
http://gcc.gnu.org/ml/gcc-patches/2013-04/msg00374.html
I believe that removes a large part of the arguments against it. Note that I
only did a few of the more obvious intrinsics, I am waiting to see if this
patch is accepted before doing more.
Bootstrap+testsuite on x86_64-linux-gnu.
* config/i386/xmmintrin.h (_mm_add_ps, _mm_sub_ps, _mm_mul_ps,
_mm_div_ps, _mm_store_ss, _mm_cvtss_f32): Use vector extensions
instead of builtins.
* config/i386/emmintrin.h (_mm_store_sd, _mm_cvtsd_f64,
_mm_storeh_pd,
_mm_cvtsi128_si64, _mm_cvtsi128_si64x, _mm_add_pd, _mm_sub_pd,
_mm_mul_pd, _mm_div_pd, _mm_storel_epi64, _mm_movepi64_pi64,
_mm_loadh_pd, _mm_loadl_pd): Likewise.
(_mm_sqrt_sd): Fix comment.
--
Marc Glisse
Marc Glisse
2014-05-17 13:34:38 UTC
Permalink
Ping
Post by Marc Glisse
Ping
http://gcc.gnu.org/ml/gcc-patches/2014-04/msg00590.html
(note that ARM seems to be doing the same thing for their neon intrinsics,
see Ramana's patch series posted today)
Post by Marc Glisse
Hello,
the previous discussion on the topic was before we added all those #pragma
http://gcc.gnu.org/ml/gcc-patches/2013-04/msg00374.html
I believe that removes a large part of the arguments against it. Note that
I only did a few of the more obvious intrinsics, I am waiting to see if
this patch is accepted before doing more.
Bootstrap+testsuite on x86_64-linux-gnu.
* config/i386/xmmintrin.h (_mm_add_ps, _mm_sub_ps, _mm_mul_ps,
_mm_div_ps, _mm_store_ss, _mm_cvtss_f32): Use vector extensions
instead of builtins.
* config/i386/emmintrin.h (_mm_store_sd, _mm_cvtsd_f64,
_mm_storeh_pd,
_mm_cvtsi128_si64, _mm_cvtsi128_si64x, _mm_add_pd, _mm_sub_pd,
_mm_mul_pd, _mm_div_pd, _mm_storel_epi64, _mm_movepi64_pi64,
_mm_loadh_pd, _mm_loadl_pd): Likewise.
(_mm_sqrt_sd): Fix comment.
--
Marc Glisse
Marc Glisse
2014-06-28 10:42:45 UTC
Permalink
Ping,

nobody has an opinion on this? Or some explanation why I am mistaken to
believe that #pragma target makes it safer now?

It would enable a number of optimizations, like constant propagation, FMA
contraction, etc. It would also allow us to remove several builtins.
Ping
Post by Marc Glisse
Ping
http://gcc.gnu.org/ml/gcc-patches/2014-04/msg00590.html
(note that ARM seems to be doing the same thing for their neon intrinsics,
see Ramana's patch series posted today)
Post by Marc Glisse
Hello,
the previous discussion on the topic was before we added all those #pragma
http://gcc.gnu.org/ml/gcc-patches/2013-04/msg00374.html
I believe that removes a large part of the arguments against it. Note that
I only did a few of the more obvious intrinsics, I am waiting to see if
this patch is accepted before doing more.
Bootstrap+testsuite on x86_64-linux-gnu.
* config/i386/xmmintrin.h (_mm_add_ps, _mm_sub_ps, _mm_mul_ps,
_mm_div_ps, _mm_store_ss, _mm_cvtss_f32): Use vector extensions
instead of builtins.
* config/i386/emmintrin.h (_mm_store_sd, _mm_cvtsd_f64, _mm_storeh_pd,
_mm_cvtsi128_si64, _mm_cvtsi128_si64x, _mm_add_pd, _mm_sub_pd,
_mm_mul_pd, _mm_div_pd, _mm_storel_epi64, _mm_movepi64_pi64,
_mm_loadh_pd, _mm_loadl_pd): Likewise.
(_mm_sqrt_sd): Fix comment.
--
Marc Glisse
Ulrich Drepper
2014-06-28 13:37:08 UTC
Permalink
Post by Marc Glisse
Ping,
nobody has an opinion on this? Or some explanation why I am mistaken to
believe that #pragma target makes it safer now?
It would enable a number of optimizations, like constant propagation, FMA
contraction, etc. It would also allow us to remove several builtins.
I see no problem with using the array-type access to the registers.

As for replacing the builtins with arithmetic operators: I appreciate
the possibility for optimization. But is there any chance the calls
could not end up being implemented with a vector instruction? I think
that would be bad. The intrinsics should be a way to guarantee that
the programmer can create vector instructions. Otherwise we might
just not support them.
Marc Glisse
2014-06-28 22:53:19 UTC
Permalink
Post by Ulrich Drepper
Post by Marc Glisse
Ping,
nobody has an opinion on this? Or some explanation why I am mistaken to
believe that #pragma target makes it safer now?
It would enable a number of optimizations, like constant propagation, FMA
contraction, etc. It would also allow us to remove several builtins.
I see no problem with using the array-type access to the registers.
As for replacing the builtins with arithmetic operators: I appreciate
the possibility for optimization. But is there any chance the calls
could not end up being implemented with a vector instruction? I think
that would be bad. The intrinsics should be a way to guarantee that
the programmer can create vector instructions. Otherwise we might
just not support them.
There is always a risk, but then even with builtins I think there was a
small risk that an RTL optimization would mess things up. It is indeed
higher if we expose the operation to the optimizers earlier, but it would
be a bug if an "optimization" replaced a vector operation by something
worse. Also, I am only proposing to handle the most trivial operations
this way, not more complicated ones (like v[0]+=s) where we would be
likely to fail generating the right instruction. And the pragma should
ensure that the function will always be compiled in a mode where the
vector instruction is available.

ARM did the same and I don't think I have seen a bug reporting a
regression about it (I haven't really looked though).

Thanks,
--
Marc Glisse
Ulrich Drepper
2014-06-29 09:40:56 UTC
Permalink
Post by Marc Glisse
There is always a risk, but then even with builtins I think there was a
small risk that an RTL optimization would mess things up. It is indeed
higher if we expose the operation to the optimizers earlier, but it would be
a bug if an "optimization" replaced a vector operation by something worse.
Also, I am only proposing to handle the most trivial operations this way,
not more complicated ones (like v[0]+=s) where we would be likely to fail
generating the right instruction. And the pragma should ensure that the
function will always be compiled in a mode where the vector instruction is
available.
ARM did the same and I don't think I have seen a bug reporting a regression
about it (I haven't really looked though).
I think the Arm definitions come from a different angle. It's new,
there is no assumed semantics. For the x86 intrinsics Intel defines
that _mm_xxx() generates one of a given opcodes if there is a match.
If I want to generate a specific code sequence I use the intrinsics.
Otherwise I could already today use the vector type semantics myself.

Don't get me wrong, I like the idea to have the optimization of the
intrinsics happening. But perhaps not unconditionally or at least not
without preventing them.

I know this will look ugly, but how about a macro
__GCC_X86_HONOR_INTRINSICS to enable the current code and have by
default your proposed use of the vector arithmetic in place? This
wouldn't allow removing support for the built-ins but it would also
open the door to some more risky optimizations to be enabled by
default.
Marc Glisse
2014-06-29 11:06:33 UTC
Permalink
Post by Ulrich Drepper
I think the Arm definitions come from a different angle. It's new,
there is no assumed semantics.
Is it that new? I thought it was implemented based on a rather precise
specification by ARM. Again, I don't really know arm.
Post by Ulrich Drepper
For the x86 intrinsics Intel defines
that _mm_xxx() generates one of a given opcodes if there is a match.
If I want to generate a specific code sequence I use the intrinsics.
We already sometimes generate a different instruction than the name of the
instrinsic suggests, or combine consecutive intrinsics into something
else. I use inline asm when I want a specific code sequence.
Post by Ulrich Drepper
Otherwise I could already today use the vector type semantics myself.
Well, the main reasons I use the intrinsics are:
1) the code compiles with visual studio
2) use the esoteric instructions (anything without a trivial mapping in C)
Post by Ulrich Drepper
Don't get me wrong, I like the idea to have the optimization of the
intrinsics happening. But perhaps not unconditionally or at least not
without preventing them.
I know this will look ugly, but how about a macro
__GCC_X86_HONOR_INTRINSICS to enable the current code and have by
default your proposed use of the vector arithmetic in place? This
wouldn't allow removing support for the built-ins but it would also
open the door to some more risky optimizations to be enabled by
default.
That's a pretty big drawback. Instead of simplifying the implementation,
it makes it more complicated. We also have to document the macro, update
the testsuite so it tests the intrinsics in both modes, etc.

I understand the concern, and I would probably implement
__GCC_X86_HONOR_INTRINSICS (though the testsuite part scares me as I have
so little understanding of how it works, so I may need help), but I'd like
to make sure first that the simpler approach is not acceptable, possibly
with strong constraints on which operations are ok (_mm_load[hl]_pd could
be removed from the patch for instance).

As another comparison, clang's version of *intrin.h uses the vector
extensions much more than I am proposing.
--
Marc Glisse
Kirill Yukhin
2014-07-03 10:17:32 UTC
Permalink
Hello Marc,
Post by Marc Glisse
It would enable a number of optimizations, like constant
propagation, FMA contraction, etc. It would also allow us to remove
several builtins.
This should be main motivation for replacing built-ins.
But this approach IMHO should only be used for `obvious' cases only.
I mean: + - / * and friends.
Think that this shouldn't apply for shuffles, broadcasts.
But we have to define border between `obvious' and rest intrinsics.

On the over hand, updated in such a way intrinsic
may actually generate different instruction then intended (e.g. FMA case).

For ICC this is generally OK to generate different instructions, only
semantics should be obeyed.

--
Thanks, K
Marc Glisse
2014-07-04 19:11:55 UTC
Permalink
Post by Kirill Yukhin
Hello Marc,
Post by Marc Glisse
It would enable a number of optimizations, like constant
propagation, FMA contraction, etc. It would also allow us to remove
several builtins.
This should be main motivation for replacing built-ins.
But this approach IMHO should only be used for `obvious' cases only.
I mean: + - / * and friends.
Think that this shouldn't apply for shuffles, broadcasts.
But we have to define border between `obvious' and rest intrinsics.
We don't have a syntax in the front-end for broadcasts anyway, but are you
sure about shuffles? __builtin_shuffle directly translates to
VEC_PERM_EXPR, on which we are careful to avoid optimizations like
combining 2 shuffles unless the result is the identity. And expanding
shuffles that can be done in a single instruction works well.

But I am happy not doing them yet. To be very specific, could you list
which intrinsics you would like to remove from the posted patch?
Post by Kirill Yukhin
On the over hand, updated in such a way intrinsic
may actually generate different instruction then intended (e.g. FMA case).
It is the same with scalars, we have -ffp-contract for that.
Post by Kirill Yukhin
For ICC this is generally OK to generate different instructions, only
semantics should be obeyed.
--
Marc Glisse
Kirill Yukhin
2014-07-08 11:14:04 UTC
Permalink
Hello Marc.
like combining 2 shuffles unless the result is the identity. And
expanding shuffles that can be done in a single instruction works
well.
But I am happy not doing them yet. To be very specific, could you
list which intrinsics you would like to remove from the posted
patch?
I am not a x86 maintainer, however while such a replacements produce
correct semantics and probably enable optimizations, I support your patch.

Probably you could try such your approach on AVX2, AVX-512 whose intrinsics
are well covered by tests?
Post by Kirill Yukhin
On the over hand, updated in such a way intrinsic
may actually generate different instruction then intended (e.g. FMA case).
It is the same with scalars, we have -ffp-contract for that.
Agreed.

--
Thanks, K
Jakub Jelinek
2014-07-08 11:17:07 UTC
Permalink
Post by Kirill Yukhin
Post by Marc Glisse
Post by Kirill Yukhin
On the over hand, updated in such a way intrinsic
may actually generate different instruction then intended (e.g. FMA case).
It is the same with scalars, we have -ffp-contract for that.
Agreed.
I don't think we actually always guarantee using the particular instructions
for the intrinsics even when they are implemented using builtins, at least
if they don't use UNSPECs, e.g. if combiner or peephole2 manage to combine
something into some other insn, we'll happily do that.

Jakub
Mike Stump
2014-07-08 16:02:18 UTC
Permalink
Post by Jakub Jelinek
Post by Kirill Yukhin
Post by Marc Glisse
Post by Kirill Yukhin
On the over hand, updated in such a way intrinsic
may actually generate different instruction then intended (e.g. FMA case).
It is the same with scalars, we have -ffp-contract for that.
Agreed.
I don't think we actually always guarantee using the particular instructions
for the intrinsics even when they are implemented using builtins, at least
if they don't use UNSPECs, e.g. if combiner or peephole2 manage to combine
something into some other insn, we'll happily do that.
In a testcase, one is free to hide the inputs and the output from the optimizer using standard tricks and take one step closer to having a 1-1 mapping. Of course, wether or not the port even offers a 1-1 mapping for any particular builtin is completely dependent upon the port.
Marc Glisse
2014-07-26 17:34:25 UTC
Permalink
Post by Kirill Yukhin
Hello Marc.
like combining 2 shuffles unless the result is the identity. And
expanding shuffles that can be done in a single instruction works
well.
But I am happy not doing them yet. To be very specific, could you
list which intrinsics you would like to remove from the posted
patch?
I am not a x86 maintainer, however while such a replacements produce
correct semantics and probably enable optimizations, I support your patch.
Probably you could try such your approach on AVX2, AVX-512 whose intrinsics
are well covered by tests?
I did some AVX and AVX512F intrinsics, and it still passes the testsuite
(on my old pre-AVX x86_64-linux-gnu).


2014-07-26 Marc Glisse <***@inria.fr>

* config/i386/xmmintrin.h (_mm_add_ps, _mm_sub_ps, _mm_mul_ps,
_mm_div_ps, _mm_store_ss, _mm_cvtss_f32): Use vector extensions
instead of builtins.
* config/i386/avxintrin.h (_mm256_add_pd, _mm256_add_ps,
_mm256_div_pd, _mm256_div_ps, _mm256_mul_pd, _mm256_mul_ps,
_mm256_sub_pd, _mm256_sub_ps): Likewise.
* config/i386/avx512fintrin.h (_mm512_add_pd, _mm512_add_ps,
_mm512_sub_pd, _mm512_sub_ps, _mm512_mul_pd, _mm512_mul_ps,
_mm512_div_pd, _mm512_div_ps): Likewise.
* config/i386/emmintrin.h (_mm_store_sd, _mm_cvtsd_f64, _mm_storeh_pd,
_mm_cvtsi128_si64, _mm_cvtsi128_si64x, _mm_add_pd, _mm_sub_pd,
_mm_mul_pd, _mm_div_pd, _mm_storel_epi64, _mm_movepi64_pi64,
_mm_loadh_pd, _mm_loadl_pd): Likewise.
(_mm_sqrt_sd): Fix comment.
--
Marc Glisse
Kirill Yukhin
2014-07-29 10:50:53 UTC
Permalink
Hello Marc,
Post by Marc Glisse
I did some AVX and AVX512F intrinsics, and it still passes the
testsuite (on my old pre-AVX x86_64-linux-gnu).
I've performed testing of your patch using functional simulator of
AVX*. And see no regressions as well.

--
Thanks, K
Marc Glisse
2014-10-09 10:33:50 UTC
Permalink
Ping https://gcc.gnu.org/ml/gcc-patches/2014-07/msg01812.html

(another part of the discussion is around
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02288.html )

Most people who commented seem cautiously in favor. The least favorable
was Ulrich who suggested to go with it but keep the old behavior
accessible if the user defines some macro (which imho would lose a large
part of the simplification benefits of the patch)
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02328.html

If this is accepted, I will gladly prepare patches removing the unused
builtins and extending this to a few more operations (integer vectors in
particular). If this is not the direction we want to go, I'd like to hear
it clearly so I can move on...

My main doubt with the current patch is whether it is better to write
simply (both variables have type __m128d):
__A + __B
or, as we will have to do for integers:
(__m128d)((__v2df)__A + (__v2df)__B)
--
Marc Glisse
Uros Bizjak
2014-10-09 11:36:42 UTC
Permalink
Post by Marc Glisse
Ping https://gcc.gnu.org/ml/gcc-patches/2014-07/msg01812.html
(another part of the discussion is around
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02288.html )
Most people who commented seem cautiously in favor. The least favorable was
Ulrich who suggested to go with it but keep the old behavior accessible if
the user defines some macro (which imho would lose a large part of the
simplification benefits of the patch)
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02328.html
If this is accepted, I will gladly prepare patches removing the unused
builtins and extending this to a few more operations (integer vectors in
particular). If this is not the direction we want to go, I'd like to hear it
clearly so I can move on...
Well, I'm undecided.

The current approach is proven to work OK, there is no bugs reported
in this area and the performance is apparently OK. There should be
clear benefits in order to change something that "ain't broken", and
at least some proof that we won't regress in this area with the new
approach.

On the other hand, if the new approach opens new optimization
opportunities (without regression!), I'm in favor of it, including the
fact that new code won't produce equivalent assembly - as long as
functionality of the optimized asm stays the same (obviously, I'd
say).

Please also note that this is quite big project. There are plenty of
intrinsics and I for one don't want another partial transition ...

TL/DR: If there are benefits, no regressions and you think you'll
finish the transition, let's go for it.

Uros.
Marc Glisse
2014-10-09 12:28:30 UTC
Permalink
Post by Uros Bizjak
Post by Marc Glisse
Ping https://gcc.gnu.org/ml/gcc-patches/2014-07/msg01812.html
(another part of the discussion is around
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02288.html )
Most people who commented seem cautiously in favor. The least favorable was
Ulrich who suggested to go with it but keep the old behavior accessible if
the user defines some macro (which imho would lose a large part of the
simplification benefits of the patch)
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02328.html
If this is accepted, I will gladly prepare patches removing the unused
builtins and extending this to a few more operations (integer vectors in
particular). If this is not the direction we want to go, I'd like to hear it
clearly so I can move on...
Well, I'm undecided.
First, thanks for answering, it helps me a lot to know what others think.
Post by Uros Bizjak
The current approach is proven to work OK, there is no bugs reported
in this area and the performance is apparently OK. There should be
clear benefits in order to change something that "ain't broken", and
at least some proof that we won't regress in this area with the new
approach.
There are quite a few enhancement PRs asking for more performance, but
indeed no (or very few) complaints about correctness or about gcc turning
their code into something worse than what they wrote, which I completely
agree weighs more.
Post by Uros Bizjak
On the other hand, if the new approach opens new optimization
opportunities (without regression!), I'm in favor of it, including the
fact that new code won't produce equivalent assembly - as long as
functionality of the optimized asm stays the same (obviously, I'd
say).
Please also note that this is quite big project. There are plenty of
intrinsics and I for one don't want another partial transition ...
That might be an issue : this transition is partial by nature. Many
intrinsics cannot (easily) be expressed in GIMPLE, and among those that
can be represented, we only want to change those for which we are
confident that we will not regress the quality of the code. From the
reactions, I would assume that we want to be quite conservative at the
beginning, and maybe we can reconsider some other intrinsics later.

The best I can offer is consistency: if addition of v2df is changed,
addition of v4df is changed as well (and say any +-*/ of float/double
vectors of any supported size). Another block would be +-*/% for integer
vectors. And construction / access (most construction is already
builtin-free). And remove the unused builtins in the same patch that makes
them unused. If you don't like those blocks, I can write one mega-patch
that does all these, if we roughly agree on the list beforehand, so it
goes in all at once.

Would that be good enough?
Post by Uros Bizjak
TL/DR: If there are benefits, no regressions and you think you'll
finish the transition, let's go for it.
--
Marc Glisse
Uros Bizjak
2014-10-09 12:57:45 UTC
Permalink
Post by Marc Glisse
Post by Uros Bizjak
Post by Marc Glisse
Ping https://gcc.gnu.org/ml/gcc-patches/2014-07/msg01812.html
(another part of the discussion is around
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02288.html )
Most people who commented seem cautiously in favor. The least favorable was
Ulrich who suggested to go with it but keep the old behavior accessible if
the user defines some macro (which imho would lose a large part of the
simplification benefits of the patch)
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02328.html
If this is accepted, I will gladly prepare patches removing the unused
builtins and extending this to a few more operations (integer vectors in
particular). If this is not the direction we want to go, I'd like to hear it
clearly so I can move on...
Well, I'm undecided.
First, thanks for answering, it helps me a lot to know what others think.
Post by Uros Bizjak
The current approach is proven to work OK, there is no bugs reported
in this area and the performance is apparently OK. There should be
clear benefits in order to change something that "ain't broken", and
at least some proof that we won't regress in this area with the new
approach.
There are quite a few enhancement PRs asking for more performance, but
indeed no (or very few) complaints about correctness or about gcc turning
their code into something worse than what they wrote, which I completely
agree weighs more.
Post by Uros Bizjak
On the other hand, if the new approach opens new optimization
opportunities (without regression!), I'm in favor of it, including the
fact that new code won't produce equivalent assembly - as long as
functionality of the optimized asm stays the same (obviously, I'd
say).
Please also note that this is quite big project. There are plenty of
intrinsics and I for one don't want another partial transition ...
That might be an issue : this transition is partial by nature. Many
intrinsics cannot (easily) be expressed in GIMPLE, and among those that can
be represented, we only want to change those for which we are confident that
we will not regress the quality of the code. From the reactions, I would
assume that we want to be quite conservative at the beginning, and maybe we
can reconsider some other intrinsics later.
The best I can offer is consistency: if addition of v2df is changed,
addition of v4df is changed as well (and say any +-*/ of float/double
vectors of any supported size). Another block would be +-*/% for integer
vectors. And construction / access (most construction is already
builtin-free). And remove the unused builtins in the same patch that makes
them unused. If you don't like those blocks, I can write one mega-patch that
does all these, if we roughly agree on the list beforehand, so it goes in
all at once.
Would that be good enough?
OK, let's go in the proposed way, more detailed:

- we begin with +-*/ of float/double vectors. IMO, this would result
in a relatively small and easily reviewable patch to iron out the
details of the approach. Alternatively, we can begin with floats only.
- commit the patch and wait for the sky to fall down.
- we play a bit with the compiler to check generated code and corner
cases (some kind of Q/A) and wait if someone finds a problem (say, a
couple of weeks).
- if there are no problems, continue with integer builtins following
the established approach, otherwise we revert everything and go back
to the drawing board.
- repeat the procedure for other builtins.

I propose to wait a couple of days for possible comments before we get
the ball rolling.

Uros.
Kirill Yukhin
2014-10-09 15:13:55 UTC
Permalink
Hello folks,
Post by Uros Bizjak
- we begin with +-*/ of float/double vectors. IMO, this would result
in a relatively small and easily reviewable patch to iron out the
details of the approach. Alternatively, we can begin with floats only.
- commit the patch and wait for the sky to fall down.
- we play a bit with the compiler to check generated code and corner
cases (some kind of Q/A) and wait if someone finds a problem (say, a
couple of weeks).
- if there are no problems, continue with integer builtins following
the established approach, otherwise we revert everything and go back
to the drawing board.
- repeat the procedure for other builtins.
I propose to wait a couple of days for possible comments before we get
the ball rolling.
Let me repeat, I think this is good idea to do.
I just wanted to kindly ask you wait for about 1-2ww before checking-in
this things.
I hope in that time AVX-512VL,BW,DQ will hit trunk completely
and *lots* more intrinsics will be added (I think intrinsics is
subject of ~[85/n] patch).

--
Thanks, K
Post by Uros Bizjak
Uros.
H.J. Lu
2014-10-09 15:30:49 UTC
Permalink
Post by Uros Bizjak
Post by Marc Glisse
Post by Uros Bizjak
Post by Marc Glisse
Ping https://gcc.gnu.org/ml/gcc-patches/2014-07/msg01812.html
(another part of the discussion is around
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02288.html )
Most people who commented seem cautiously in favor. The least favorable was
Ulrich who suggested to go with it but keep the old behavior accessible if
the user defines some macro (which imho would lose a large part of the
simplification benefits of the patch)
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02328.html
If this is accepted, I will gladly prepare patches removing the unused
builtins and extending this to a few more operations (integer vectors in
particular). If this is not the direction we want to go, I'd like to hear it
clearly so I can move on...
Well, I'm undecided.
First, thanks for answering, it helps me a lot to know what others think.
Post by Uros Bizjak
The current approach is proven to work OK, there is no bugs reported
in this area and the performance is apparently OK. There should be
clear benefits in order to change something that "ain't broken", and
at least some proof that we won't regress in this area with the new
approach.
There are quite a few enhancement PRs asking for more performance, but
indeed no (or very few) complaints about correctness or about gcc turning
their code into something worse than what they wrote, which I completely
agree weighs more.
Post by Uros Bizjak
On the other hand, if the new approach opens new optimization
opportunities (without regression!), I'm in favor of it, including the
fact that new code won't produce equivalent assembly - as long as
functionality of the optimized asm stays the same (obviously, I'd
say).
Please also note that this is quite big project. There are plenty of
intrinsics and I for one don't want another partial transition ...
That might be an issue : this transition is partial by nature. Many
intrinsics cannot (easily) be expressed in GIMPLE, and among those that can
be represented, we only want to change those for which we are confident that
we will not regress the quality of the code. From the reactions, I would
assume that we want to be quite conservative at the beginning, and maybe we
can reconsider some other intrinsics later.
The best I can offer is consistency: if addition of v2df is changed,
addition of v4df is changed as well (and say any +-*/ of float/double
vectors of any supported size). Another block would be +-*/% for integer
vectors. And construction / access (most construction is already
builtin-free). And remove the unused builtins in the same patch that makes
them unused. If you don't like those blocks, I can write one mega-patch that
does all these, if we roughly agree on the list beforehand, so it goes in
all at once.
Would that be good enough?
- we begin with +-*/ of float/double vectors. IMO, this would result
in a relatively small and easily reviewable patch to iron out the
details of the approach. Alternatively, we can begin with floats only.
- commit the patch and wait for the sky to fall down.
- we play a bit with the compiler to check generated code and corner
cases (some kind of Q/A) and wait if someone finds a problem (say, a
couple of weeks).
- if there are no problems, continue with integer builtins following
the established approach, otherwise we revert everything and go back
to the drawing board.
- repeat the procedure for other builtins.
I propose to wait a couple of days for possible comments before we get
the ball rolling.
We should also include some testcases to show code improvement
for each change.

Thanks.
--
H.J.
Olivier Hainque
2014-10-09 16:59:23 UTC
Permalink
Hello Marc,
If this is accepted, I will gladly prepare patches removing the unused builtins and extending this to a few more operations (integer vectors in particular). If this is not the direction we want to go, I'd like to hear it clearly so I can move on...
As we discussed offlist, removing all the builtins would be problematic for
Ada as they are the only medium allowing flexible access to vector instructions
(aside autovectorization) for users.

Today, the model is very simple: people who want to build on top of vector
operations just bind to the builtins they need and expose higher level
interfaces if they like, provided proper type definitions (see g-sse.ads for
example).

We could provide an Ada version of the standard APIs for example, as we do for
Altivec on powerpc, and we have offered this capability out of customer requests.

Without the builtins, we'd need to define syntax + semantics for vector
operations in the language. While this is an interesting perspective, we
don't have that today and this would be a fair amount of non-trivial work
I'm afraid, not something we can take on just like that.

Note that this doesn't mean that we need all the builtins to remain there.
Just at least one of those providing access to a given machine insn at some
point. We can implement various sorts of always_inline wrappers to perform
type conversions as needed, and builtins are understood as very low level
devices so changes in the interface aren't an issue. The real issue would be
if access to a given insn becomes impossible out of the removals.

Thanks!

With Kind Regards,

Olivier
Marc Glisse
2014-10-09 17:46:03 UTC
Permalink
Post by Olivier Hainque
If this is accepted, I will gladly prepare patches removing the unused builtins and extending this to a few more operations (integer vectors in particular). If this is not the direction we want to go, I'd like to hear it clearly so I can move on...
As we discussed offlist, removing all the builtins would be problematic for
Ada as they are the only medium allowing flexible access to vector instructions
(aside autovectorization) for users.
Today, the model is very simple: people who want to build on top of vector
operations just bind to the builtins they need and expose higher level
interfaces if they like, provided proper type definitions (see g-sse.ads for
example).
It is sad that this prevents us from removing the builtins, but I agree
that we can't just drop ada+sse users like that. Well, less work for me if
I don't have to remove the builtins, and my main motivation is
optimization, even if I tried to sell the clean up to convince people.

Uros, is it still ok if I change the intrinsics without removing the
builtins? (with testcases for HJ and not before Kirill says it is ok)
Post by Olivier Hainque
Without the builtins, we'd need to define syntax + semantics for vector
operations in the language. While this is an interesting perspective, we
don't have that today and this would be a fair amount of non-trivial work
I'm afraid, not something we can take on just like that.
I think it is an interesting possibility to keep in mind (maybe in a few
years?). Basic support in the C front-end is surprisingly simple (C++
templates are a different story), and doesn't need to be duplicated for
sse/altivec/neon... only the "weird" operations really need builtins.

Thanks for posting this,
--
Marc Glisse
Uros Bizjak
2014-10-09 17:56:02 UTC
Permalink
Post by Olivier Hainque
Post by Marc Glisse
If this is accepted, I will gladly prepare patches removing the unused
builtins and extending this to a few more operations (integer vectors in
particular). If this is not the direction we want to go, I'd like to hear it
clearly so I can move on...
As we discussed offlist, removing all the builtins would be problematic for
Ada as they are the only medium allowing flexible access to vector instructions
(aside autovectorization) for users.
Today, the model is very simple: people who want to build on top of vector
operations just bind to the builtins they need and expose higher level
interfaces if they like, provided proper type definitions (see g-sse.ads for
example).
It is sad that this prevents us from removing the builtins, but I agree that
we can't just drop ada+sse users like that. Well, less work for me if I
don't have to remove the builtins, and my main motivation is optimization,
even if I tried to sell the clean up to convince people.
Uros, is it still ok if I change the intrinsics without removing the
builtins? (with testcases for HJ and not before Kirill says it is ok)
Given that this will be a substantial work and considering the request
from Kirill, what do you think about separate development branch until
AVXn stuff is finished? This will give a couple of weeks and a
playground to finalize the approach for the conversion. Maybe even ada
can be tested there to not regress with the compatibility stuff.

Uros.
Marc Glisse
2014-10-09 18:04:35 UTC
Permalink
Post by Uros Bizjak
Given that this will be a substantial work and considering the request
from Kirill, what do you think about separate development branch until
AVXn stuff is finished? This will give a couple of weeks and a
playground to finalize the approach for the conversion. Maybe even ada
can be tested there to not regress with the compatibility stuff.
No problem. We can also wait until next stage1 if you believe the release
of gcc-5 is too close.
--
Marc Glisse
Loading...