Discussion:
[PATCH RFC]Pair load store instructions using a generic scheduling fusion pass
Bin Cheng
2014-09-30 09:22:51 UTC
Permalink
Hi,
Last time I posted the patch pairing consecutive load/store instructions on
ARM, the patch got some review comments. The most important one, as
suggested by Jeff and Mike, is about to do the load/store pairing using
existing scheduling facility. In the initial investigation, I cleared up
Mike's patch and fixed some implementation bugs in it, it can now find as
many load/store pairs as my old patch. Then I decided to take one step
forward to introduce a generic instruction fusion infrastructure in GCC,
because in essence, load/store pair is nothing different with other
instruction fusion, all these optimizations want is to push instructions
together in instruction flow.
So here comes this patch. It adds a new sched_fusion pass just before
peephole2. The methodology is like:
1) The priority in scheduler is extended into [fusion_priority, priority]
pair, with fusion_priority as the major key and priority as the minor key.
2) The back-end assigns priorities pair to each instruction, instructions
want to be fused together get same fusion_priority assigned.
3) The haifa scheduler schedules instructions based on fusion priorities,
all other parts are just like the original sched2 pass. Of course, some
functions can be simplified/skipped in this process.
4) With instructions fused together in flow, the following peephole2 pass
can easily transform interesting instructions into other forms, just like
ldrd/strd for ARM.

The new infrastructure can handle different kinds of fusion in one pass.
It's also easy to extend for new fusion cases, all it takes is to identify
instructions which want to be fused and assign new fusion priorities to
them. Also as Mike suggested last time, the patch can be very simple by
reusing existing scheduler facility.

I collected performance data for both cortex-a15 and cortex-a57 (with a
local peephole ldp/stp patch), the benchmarks can be obviously improved on
arm/aarch64. I also collected instrument data about how many load/store
pairs are found. For the four versions of load/store pair patches:
0) The original Mike's patch.
1) My original prototype patch.
2) Cleaned up pass of Mike (with implementation bugs resolved).
3) This new prototype fusion pass.

The numbers of paired opportunities satisfy below relations:
3 * N0 ~ N1 ~ N2 < N3
For example, for one benchmark suite, we have:
N0 ~= 1300
N1/N2 ~= 5000
N3 ~= 7500

As a matter of fact, if we move sched_fusion and peephole2 pass after
register renaming (~11000 for above benchmark suite), then enable register
renaming pass, this patch can find even more load store pairs. But rename
pass has its own impact on performance and we need more benchmark data
before doing that change.

Of course, this patch is no the perfect solution, it does miss load/store
pair in some corner cases which have complicated instruction dependencies.
Actually it breaks one load/store pair test on armv6 because of the corner
case, that's why the pass is disabled by default on non-armv7 processors. I
may investigate the failure and try to enable the pass for all arm targets
in the future.

So any comments on this?

2014-09-30 Bin Cheng <***@arm.com>
Mike Stump <***@comcast.net>

* timevar.def (TV_SCHED_FUSION): New time var.
* passes.def (pass_sched_fusion): New pass.
* config/arm/arm.c (TARGET_SCHED_FUSION_PRIORITY): New.
(extract_base_offset_in_addr, fusion_load_store): New.
(arm_sched_fusion_priority): New.
(arm_option_override): Disable scheduling fusion on non-armv7
processors by default.
* sched-int.h (struct _haifa_insn_data): New field.
(INSN_FUSION_PRIORITY, FUSION_MAX_PRIORITY, sched_fusion): New.
* sched-rgn.c (rest_of_handle_sched_fusion): New.
(pass_data_sched_fusion, pass_sched_fusion): New.
(make_pass_sched_fusion): New.
* haifa-sched.c (sched_fusion): New.
(insn_cost): Handle sched_fusion.
(priority): Handle sched_fusion by calling target hook.
(enum rfs_decision): New enum value.
(rfs_str): New element for RFS_FUSION.
(rank_for_schedule): Support sched_fusion.
(schedule_insn, max_issue, prune_ready_list): Handle sched_fusion.
(schedule_block, fix_tick_ready): Handle sched_fusion.
* common.opt (flag_schedule_fusion): New.
* tree-pass.h (make_pass_sched_fusion): New.
* target.def (fusion_priority): New.
* doc/tm.texi.in (TARGET_SCHED_FUSION_PRIORITY): New.
* doc/tm.texi: Regenerated.
* doc/invoke.texi (-fschedule-fusion): New.

gcc/testsuite/ChangeLog
2014-09-30 Bin Cheng <***@arm.com>

* gcc.target/arm/ldrd-strd-pair-1.c: New test.
* gcc.target/arm/vfp-1.c: Improve scanning string.
Mike Stump
2014-09-30 21:06:16 UTC
Permalink
Then I decided to take one step forward to introduce a generic
instruction fusion infrastructure in GCC, because in essence, load/store
pair is nothing different with other instruction fusion, all these optimizations
want is to push instructions together in instruction flow.
I like the step you took. I had exactly this in mind when I wrote the original.
N0 ~= 1300
N1/N2 ~= 5000
N3 ~= 7500
Nice. Would be nice to see metrics for time to ensure that the code isn’t actually worse (CSiBE and/or spec and/or some other). I didn’t have any large scale benchmark runs with my code and I did worry about extending lifetimes and register pressure.
I cleared up Mike's patch and fixed some implementation bugs in it
So, I’m wondering what the bugs or missed opportunities were? And, if they were of the type of problem that generated incorrect code or if they were of the type that was merely a missed opportunity.
Bin.Cheng
2014-10-06 09:57:55 UTC
Permalink
Post by Mike Stump
Then I decided to take one step forward to introduce a generic
instruction fusion infrastructure in GCC, because in essence, load/store
pair is nothing different with other instruction fusion, all these optimizations
want is to push instructions together in instruction flow.
I like the step you took. I had exactly this in mind when I wrote the original.
N0 ~= 1300
N1/N2 ~= 5000
N3 ~= 7500
Nice. Would be nice to see metrics for time to ensure that the code isn't actually worse (CSiBE and/or spec and/or some other). I didn't have any large scale benchmark runs with my code and I did worry about extending lifetimes and register pressure.
Hi Mike,
I did collect spec2k performance after pairing load/store using this
patch on both aarch64 and cortex-a15. The performance is improved
obviously, especially on cortex-a57. There are some (though not many)
benchmarks are regressed a little. There is no register pressure
problem here because this pass is put between register allocation and
sched2, I guess sched2 should resolve most pipeline hazards introduced
by this pass.
Post by Mike Stump
I cleared up Mike's patch and fixed some implementation bugs in it
So, I'm wondering what the bugs or missed opportunities were? And, if they were of the type of problem that generated incorrect code or if they were of the type that was merely a missed opportunity.
Just missed opportunity issues.

Thanks,
bin
Richard Biener
2014-10-06 11:32:28 UTC
Permalink
Post by Bin.Cheng
Post by Mike Stump
Then I decided to take one step forward to introduce a generic
instruction fusion infrastructure in GCC, because in essence, load/store
pair is nothing different with other instruction fusion, all these optimizations
want is to push instructions together in instruction flow.
I like the step you took. I had exactly this in mind when I wrote the original.
N0 ~= 1300
N1/N2 ~= 5000
N3 ~= 7500
Nice. Would be nice to see metrics for time to ensure that the code isn't actually worse (CSiBE and/or spec and/or some other). I didn't have any large scale benchmark runs with my code and I did worry about extending lifetimes and register pressure.
Hi Mike,
I did collect spec2k performance after pairing load/store using this
patch on both aarch64 and cortex-a15. The performance is improved
obviously, especially on cortex-a57. There are some (though not many)
benchmarks are regressed a little. There is no register pressure
problem here because this pass is put between register allocation and
sched2, I guess sched2 should resolve most pipeline hazards introduced
by this pass.
How many merging opportunities does sched2 undo again? ISTR it
has the tendency of pushing stores down and loads up.

Richard.
Post by Bin.Cheng
Post by Mike Stump
I cleared up Mike's patch and fixed some implementation bugs in it
So, I'm wondering what the bugs or missed opportunities were? And, if they were of the type of problem that generated incorrect code or if they were of the type that was merely a missed opportunity.
Just missed opportunity issues.
Thanks,
bin
Mike Stump
2014-10-06 17:20:03 UTC
Permalink
Post by Richard Biener
How many merging opportunities does sched2 undo again? ISTR it
has the tendency of pushing stores down and loads up.
So, the pass works by merging 2 or more loads into 1 load (at least on my port). sched2 would need to rip apart 1 load into 2 loads to be able to undo the real work. The non-real work, doesn’t matter any. Can sched2 rip apart a single load?
Bin.Cheng
2014-10-07 01:31:27 UTC
Permalink
Post by Richard Biener
How many merging opportunities does sched2 undo again? ISTR it
has the tendency of pushing stores down and loads up.
So, the pass works by merging 2 or more loads into 1 load (at least on my port). sched2 would need to rip apart 1 load into 2 loads to be able to undo the real work. The non-real work, doesn't matter any. Can sched2 rip apart a single load?
On ARM and AARCH64, the two merged load/store are transformed into
single parallel insn by the following peephole2 pass, so that sched2
would not undo the fusion work. I though sched2 works on the basis of
instructions, and it isn't good practice to have sched2 do split work.

Thanks,
bin
Jeff Law
2014-10-08 05:28:25 UTC
Permalink
Post by Bin.Cheng
Post by Richard Biener
How many merging opportunities does sched2 undo again? ISTR it
has the tendency of pushing stores down and loads up.
So, the pass works by merging 2 or more loads into 1 load (at least on my port). sched2 would need to rip apart 1 load into 2 loads to be able to undo the real work. The non-real work, doesn't matter any. Can sched2 rip apart a single load?
On ARM and AARCH64, the two merged load/store are transformed into
single parallel insn by the following peephole2 pass, so that sched2
would not undo the fusion work. I though sched2 works on the basis of
instructions, and it isn't good practice to have sched2 do split work.
It's certainly advantageous for sched2 to split insns that generate
multiple instructions. Running after register allocation, sched2 is
ideal for splitting because the we know the alternative for each insn
and thus we can (possibly for the first time) accurately know if a
particular insn will generate multiple assembly instructions.

If the port has a splitter to rip apart a douple-word load into
single-word loads, then we'd obviously only want to do that in cases
where the double-word load actually generates > 1 assembly instruction.

Addressing issues in that space seems out of scope for Bin's work to me,
except perhaps for such issues on aarch64/arm which are Bin's primary
concerns.

jeff
Bin.Cheng
2014-10-08 05:52:18 UTC
Permalink
Post by Bin.Cheng
Post by Mike Stump
Post by Richard Biener
How many merging opportunities does sched2 undo again? ISTR it
has the tendency of pushing stores down and loads up.
So, the pass works by merging 2 or more loads into 1 load (at least on my
port). sched2 would need to rip apart 1 load into 2 loads to be able to
undo the real work. The non-real work, doesn't matter any. Can sched2 rip
apart a single load?
On ARM and AARCH64, the two merged load/store are transformed into
single parallel insn by the following peephole2 pass, so that sched2
would not undo the fusion work. I though sched2 works on the basis of
instructions, and it isn't good practice to have sched2 do split work.
It's certainly advantageous for sched2 to split insns that generate multiple
instructions. Running after register allocation, sched2 is ideal for
splitting because the we know the alternative for each insn and thus we can
(possibly for the first time) accurately know if a particular insn will
generate multiple assembly instructions.
If the port has a splitter to rip apart a douple-word load into single-word
loads, then we'd obviously only want to do that in cases where the
double-word load actually generates > 1 assembly instruction.
Addressing issues in that space seems out of scope for Bin's work to me,
except perhaps for such issues on aarch64/arm which are Bin's primary
concerns.
Hi Jeff,

Thanks very much for the explanation. Very likely I am wrong here,
but seems what you mentioned fits to pass_split_before_sched2 very
well. Then I guess it would be nice if we can differentiate cases in
the first place by generating different patterns, rather than split
some of instructions later. Though I have no idea if we can do that
or not.

For arm/aarch64, I guess it's not an issue, otherwise the peephole2
won't work at all. ARM maintainers should have answer to this.
jeff
Ramana Radhakrishnan
2014-10-08 10:27:54 UTC
Permalink
Post by Bin.Cheng
If the port has a splitter to rip apart a douple-word load into single-word loads, then we'd obviously only want to do that in cases where the double-word load actually generates > 1 assembly instruction.
Or indeed if it is really a performance win. And I think that should
purely be a per port / micro-architectural decision .
Post by Bin.Cheng
For arm/aarch64, I guess it's not an issue, otherwise the peephole2
won't work at all. ARM maintainers should have answer to this.
Generating more ldrd's and strd's will be beneficial in the ARM and
the AArch64 port - we save code size and start using more memory
bandwidth available per instruction on most higher end cores that I'm
aware of. Even on the smaller microcontrollers I expect it to be a win
because you've saved code size. There may well be pathological cases
given we've shortened some dependencies or increased lifetimes of
others but overall I'd expect it to be more positive than negative.

I also expect this to be more effective in the T32 (Thumb2) ISA and
AArch64 because ldrd/ strd and ldp / stp respectively can work with
any registers unlike the A32 ISA where the registers loaded or stored
must be consecutive registers. I'm hoping for some more review on the
generic bits before looking into the backend implementation in the
expectation that this is the direction folks want to proceed.


regards
Ramana
Post by Bin.Cheng
jeff
Jeff Law
2014-10-08 22:21:43 UTC
Permalink
Post by Ramana Radhakrishnan
If the port has a splitter to rip apart a douple-word load into single-word loads, then we'd obviously only want to do that in cases where the double-word load actually generates > 1 assembly instruction.
Or indeed if it is really a performance win. And I think that should
purely be a per port / micro-architectural decision .
Agreed.
Post by Ramana Radhakrishnan
Generating more ldrd's and strd's will be beneficial in the ARM and
the AArch64 port - we save code size and start using more memory
bandwidth available per instruction on most higher end cores that I'm
aware of. Even on the smaller microcontrollers I expect it to be a win
because you've saved code size. There may well be pathological cases
given we've shortened some dependencies or increased lifetimes of
others but overall I'd expect it to be more positive than negative.
Agreed. I suspect there's multiple architectures where the results
would be similar -- code size improvements, more effective use of memory
bandwidth with possibly some pathological case(s) that we really
shouldn't worry too much about.
Post by Ramana Radhakrishnan
I also expect this to be more effective in the T32 (Thumb2) ISA and
AArch64 because ldrd/ strd and ldp / stp respectively can work with
any registers unlike the A32 ISA where the registers loaded or stored
must be consecutive registers. I'm hoping for some more review on the
generic bits before looking into the backend implementation in the
expectation that this is the direction folks want to proceed.
I've got some questions that I'm formulating to make sure I understand
how the facility is to be used. I may have to simply sit down with the
code installed on a test build and play with it.

However, to be clear, I really like the direction this work has gone.

Jeff
Mike Stump
2014-10-09 11:14:12 UTC
Permalink
It's certainly advantageous for sched2 to split insns that generate multiple instructions.
So, on my port, I have a load multiple that is just one instruction, and it is a single clock cycle (to enque it).
Loading...