[PATCH] Fix PR63259: bswap not recognized when finishing with rotation

Discussion:

Thomas Preud'homme

2014-10-08 01:56:51 UTC

Currently the bswap pass only look for bswap pattern by examining bitwise
OR statement and doing following def-use chains. However a rotation
(left or right) can finish a manual byteswap, as shown in the following example:

unsigned
byteswap_ending_with_rotation (unsigned in)
{
in = ((in & 0xff00ff00) >> 8) | ((in & 0x00ff00ff) << 8);
in = ((in & 0xffff0000) >> 16) | ((in & 0x0000ffff) << 16);
return in;
}

which is compiled into:

byteswap_ending_with_rotation (unsigned int in)
{
unsigned int _2;
unsigned int _3;
unsigned int _4;
unsigned int _5;

<bb 2>:
_2 = in_1(D) & 4278255360;
_3 = _2 >> 8;
_4 = in_1(D) & 16711935;
_5 = _4 << 8;
in_6 = _5 | _3;
in_7 = in_6 r>> 16;
return in_7;

}

This patch adds rotation (left and right) to the list of statement to consider for byte swap.

ChangeLog are as follows:

*** gcc/ChangeLog ***

2014-09-30 Thomas Preud'homme <***@arm.com>

PR tree-optimization/63259
* tree-ssa-math-opts.c (pass_optimize_bswap::execute): Also consider
bswap in LROTATE_EXPR and RROTATE_EXPR statements.

*** gcc/testsuite/ChangeLog ***

2014-09-30 Thomas Preud'homme <***@arm.com>

PR tree-optimization/63259
* optimize-bswapsi-1.c (swap32_e): New bswap pass test.

diff --git a/gcc/testsuite/gcc.dg/optimize-bswapsi-1.c b/gcc/testsuite/gcc.dg/optimize-bswapsi-1.c
index 580e6e0..d4b5740 100644
--- a/gcc/testsuite/gcc.dg/optimize-bswapsi-1.c
+++ b/gcc/testsuite/gcc.dg/optimize-bswapsi-1.c
@@ -64,5 +64,16 @@ swap32_d (SItype in)
| (((in >> 24) & 0xFF) << 0);
}

-/* { dg-final { scan-tree-dump-times "32 bit bswap implementation found at" 4 "bswap" } } */
+/* This variant comes from PR63259. It compiles to a gimple sequence that ends
+ with a rotation instead of a bitwise OR. */
+
+unsigned
+swap32_e (unsigned in)
+{
+ in = ((in & 0xff00ff00) >> 8) | ((in & 0x00ff00ff) << 8);
+ in = ((in & 0xffff0000) >> 16) | ((in & 0x0000ffff) << 16);
+ return in;
+}
+
+/* { dg-final { scan-tree-dump-times "32 bit bswap implementation found at" 5 "bswap" } } */
/* { dg-final { cleanup-tree-dump "bswap" } } */
diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c
index 3c6e935..2023f2e 100644
--- a/gcc/tree-ssa-math-opts.c
+++ b/gcc/tree-ssa-math-opts.c
@@ -2377,11 +2377,16 @@ pass_optimize_bswap::execute (function *fun)
{
gimple src_stmt, cur_stmt = gsi_stmt (gsi);
tree fndecl = NULL_TREE, bswap_type = NULL_TREE, load_type;
+ enum tree_code code;
struct symbolic_number n;
bool bswap;

- if (!is_gimple_assign (cur_stmt)
- || gimple_assign_rhs_code (cur_stmt) != BIT_IOR_EXPR)
+ if (!is_gimple_assign (cur_stmt))
+ continue;
+
+ code = gimple_assign_rhs_code (cur_stmt);
+ if (code != BIT_IOR_EXPR && code != LROTATE_EXPR
+ && code != RROTATE_EXPR)
continue;

src_stmt = find_bswap_or_nop (cur_stmt, &n, &bswap);

Testing was done by running the testsuite on arm-none-eabi target with QEMU
emulating Cortex-M3: no regression were found. Due to the potential increase
in compilation time, A bootstrap with sequential build (no -j option when calling
make) and with default option was made with and without the patch. The
results shows no increase compilation time:

r215662 with patch:
make 6167.48s user 401.03s system 99% cpu 1:49:52.07 total

r215662 without patch
make 6136.63s user 400.32s system 99% cpu 1:49:27.28 total

Is it ok for trunk?

Best regards,

Thomas Preud'homme

Jakub Jelinek

2014-10-08 06:38:43 UTC

Permalink

Post by Thomas Preud'homme
--- a/gcc/tree-ssa-math-opts.c
+++ b/gcc/tree-ssa-math-opts.c
@@ -2377,11 +2377,16 @@ pass_optimize_bswap::execute (function *fun)
{
gimple src_stmt, cur_stmt = gsi_stmt (gsi);
tree fndecl = NULL_TREE, bswap_type = NULL_TREE, load_type;
+ enum tree_code code;
struct symbolic_number n;
bool bswap;
- if (!is_gimple_assign (cur_stmt)
- || gimple_assign_rhs_code (cur_stmt) != BIT_IOR_EXPR)
+ if (!is_gimple_assign (cur_stmt))
+ continue;
+
+ code = gimple_assign_rhs_code (cur_stmt);
+ if (code != BIT_IOR_EXPR && code != LROTATE_EXPR
+ && code != RROTATE_EXPR)
continue;
src_stmt = find_bswap_or_nop (cur_stmt, &n, &bswap);

Doesn't it turn 16-bit {L,R}ROTATE_EXPR used alone into __builtin_bswap16?
For those the question is if the canonical GIMPLE should be the rotation or
byteswap, I'd think rotation would be perhaps better. Or depending on if
the backend has bswaphi2 or rotate pattern?

Also, perhaps you could short-circuit this if the rotation isn't by constant
or not a multiple of BITS_PER_UNIT. So
switch (code)
{
case BIT_IOR_EXPR:
break;
case LROTATE_EXPR:
case RROTATE_EXPR:
if (!tree_fits_uhwi_p (gimple_assign_rhs2 (cur_stmt))
|| (tree_to_uhwi (gimple_assign_rhs2 (cur_stmt))
% BITS_PER_UNIT))
continue;
break;
default:
continue;
}
?

Jakub

Thomas Preud'homme

2014-10-08 06:43:04 UTC

Permalink

Sent: Wednesday, October 08, 2014 2:39 PM
Doesn't it turn 16-bit {L,R}ROTATE_EXPR used alone into
__builtin_bswap16?
For those the question is if the canonical GIMPLE should be the rotation or
byteswap, I'd think rotation would be perhaps better. Or depending on if
the backend has bswaphi2 or rotate pattern?

Good point. It seems better to keep the status quo.

Also, perhaps you could short-circuit this if the rotation isn't by constant
or not a multiple of BITS_PER_UNIT. So
switch (code)
{
break;
if (!tree_fits_uhwi_p (gimple_assign_rhs2 (cur_stmt))
|| (tree_to_uhwi (gimple_assign_rhs2 (cur_stmt))
% BITS_PER_UNIT))
continue;
break;
continue;
}
?

Right. Thanks for the comments.

Best regards,

Thomas

Thomas Preud'homme

2014-10-08 06:56:46 UTC

Permalink

Sent: Wednesday, October 08, 2014 2:43 PM

Post by Jakub Jelinek
Also, perhaps you could short-circuit this if the rotation isn't by constant

Note that do_shift_rotate already check for this. Is it enough?

Best regards,

Thomas

Jakub Jelinek

2014-10-08 07:04:35 UTC

Permalink

Post by Thomas Preud'homme

Sent: Wednesday, October 08, 2014 2:43 PM

Post by Jakub Jelinek
Also, perhaps you could short-circuit this if the rotation isn't by constant

Note that do_shift_rotate already check for this. Is it enough?

The question is only about the compile time cost needed through all the
routines find_bswap_or_nop calls before it finds out it can't do anything
with the rotate, the short-circuiting could avoid that. E.g.
find_bswap_or_nop_1 first recurses on the argument etc.

Jakub

Richard Biener

2014-10-08 07:27:24 UTC

Permalink

Post by Jakub Jelinek

Post by Thomas Preud'homme

Sent: Wednesday, October 08, 2014 2:43 PM

Post by Jakub Jelinek
Also, perhaps you could short-circuit this if the rotation isn't by constant

Note that do_shift_rotate already check for this. Is it enough?

I wouldn't worry about that too much. Indeed the question would be what
should be canonical on GIMPLE (expanders should choose the optimal
vairant from both). I think a tree code should be always prefered to a
builtin function call - which means a rotate is more canonical than a
bswap16 call.

Post by Jakub Jelinek
From the performance side the pass could be re-structured to populate

a lattice, thus work from def to use instead of the other way around. Which
means we visit each stmt exactly once, compute its value symbolically
and check it against a rotate.

Richard.

Post by Jakub Jelinek
Jakub

Andrew Pinski

2014-10-08 06:58:12 UTC

Permalink

On Tue, Oct 7, 2014 at 11:43 PM, Thomas Preud'homme

Post by Thomas Preud'homme

Good point. It seems better to keep the status quo.

There is already code which converts __builtin_bswap16 to the rotate
so maybe it does not matter. Though I have not seen code which does
the rotate into a bswaphi2 pattern.

Thanks,
Andrew

Post by Thomas Preud'homme

Right. Thanks for the comments.
Best regards,
Thomas