Bin Cheng
2014-09-30 09:22:51 UTC
Hi,
Last time I posted the patch pairing consecutive load/store instructions on
ARM, the patch got some review comments. The most important one, as
suggested by Jeff and Mike, is about to do the load/store pairing using
existing scheduling facility. In the initial investigation, I cleared up
Mike's patch and fixed some implementation bugs in it, it can now find as
many load/store pairs as my old patch. Then I decided to take one step
forward to introduce a generic instruction fusion infrastructure in GCC,
because in essence, load/store pair is nothing different with other
instruction fusion, all these optimizations want is to push instructions
together in instruction flow.
So here comes this patch. It adds a new sched_fusion pass just before
peephole2. The methodology is like:
1) The priority in scheduler is extended into [fusion_priority, priority]
pair, with fusion_priority as the major key and priority as the minor key.
2) The back-end assigns priorities pair to each instruction, instructions
want to be fused together get same fusion_priority assigned.
3) The haifa scheduler schedules instructions based on fusion priorities,
all other parts are just like the original sched2 pass. Of course, some
functions can be simplified/skipped in this process.
4) With instructions fused together in flow, the following peephole2 pass
can easily transform interesting instructions into other forms, just like
ldrd/strd for ARM.
The new infrastructure can handle different kinds of fusion in one pass.
It's also easy to extend for new fusion cases, all it takes is to identify
instructions which want to be fused and assign new fusion priorities to
them. Also as Mike suggested last time, the patch can be very simple by
reusing existing scheduler facility.
I collected performance data for both cortex-a15 and cortex-a57 (with a
local peephole ldp/stp patch), the benchmarks can be obviously improved on
arm/aarch64. I also collected instrument data about how many load/store
pairs are found. For the four versions of load/store pair patches:
0) The original Mike's patch.
1) My original prototype patch.
2) Cleaned up pass of Mike (with implementation bugs resolved).
3) This new prototype fusion pass.
The numbers of paired opportunities satisfy below relations:
3 * N0 ~ N1 ~ N2 < N3
For example, for one benchmark suite, we have:
N0 ~= 1300
N1/N2 ~= 5000
N3 ~= 7500
As a matter of fact, if we move sched_fusion and peephole2 pass after
register renaming (~11000 for above benchmark suite), then enable register
renaming pass, this patch can find even more load store pairs. But rename
pass has its own impact on performance and we need more benchmark data
before doing that change.
Of course, this patch is no the perfect solution, it does miss load/store
pair in some corner cases which have complicated instruction dependencies.
Actually it breaks one load/store pair test on armv6 because of the corner
case, that's why the pass is disabled by default on non-armv7 processors. I
may investigate the failure and try to enable the pass for all arm targets
in the future.
So any comments on this?
2014-09-30 Bin Cheng <***@arm.com>
Mike Stump <***@comcast.net>
* timevar.def (TV_SCHED_FUSION): New time var.
* passes.def (pass_sched_fusion): New pass.
* config/arm/arm.c (TARGET_SCHED_FUSION_PRIORITY): New.
(extract_base_offset_in_addr, fusion_load_store): New.
(arm_sched_fusion_priority): New.
(arm_option_override): Disable scheduling fusion on non-armv7
processors by default.
* sched-int.h (struct _haifa_insn_data): New field.
(INSN_FUSION_PRIORITY, FUSION_MAX_PRIORITY, sched_fusion): New.
* sched-rgn.c (rest_of_handle_sched_fusion): New.
(pass_data_sched_fusion, pass_sched_fusion): New.
(make_pass_sched_fusion): New.
* haifa-sched.c (sched_fusion): New.
(insn_cost): Handle sched_fusion.
(priority): Handle sched_fusion by calling target hook.
(enum rfs_decision): New enum value.
(rfs_str): New element for RFS_FUSION.
(rank_for_schedule): Support sched_fusion.
(schedule_insn, max_issue, prune_ready_list): Handle sched_fusion.
(schedule_block, fix_tick_ready): Handle sched_fusion.
* common.opt (flag_schedule_fusion): New.
* tree-pass.h (make_pass_sched_fusion): New.
* target.def (fusion_priority): New.
* doc/tm.texi.in (TARGET_SCHED_FUSION_PRIORITY): New.
* doc/tm.texi: Regenerated.
* doc/invoke.texi (-fschedule-fusion): New.
gcc/testsuite/ChangeLog
2014-09-30 Bin Cheng <***@arm.com>
* gcc.target/arm/ldrd-strd-pair-1.c: New test.
* gcc.target/arm/vfp-1.c: Improve scanning string.
Last time I posted the patch pairing consecutive load/store instructions on
ARM, the patch got some review comments. The most important one, as
suggested by Jeff and Mike, is about to do the load/store pairing using
existing scheduling facility. In the initial investigation, I cleared up
Mike's patch and fixed some implementation bugs in it, it can now find as
many load/store pairs as my old patch. Then I decided to take one step
forward to introduce a generic instruction fusion infrastructure in GCC,
because in essence, load/store pair is nothing different with other
instruction fusion, all these optimizations want is to push instructions
together in instruction flow.
So here comes this patch. It adds a new sched_fusion pass just before
peephole2. The methodology is like:
1) The priority in scheduler is extended into [fusion_priority, priority]
pair, with fusion_priority as the major key and priority as the minor key.
2) The back-end assigns priorities pair to each instruction, instructions
want to be fused together get same fusion_priority assigned.
3) The haifa scheduler schedules instructions based on fusion priorities,
all other parts are just like the original sched2 pass. Of course, some
functions can be simplified/skipped in this process.
4) With instructions fused together in flow, the following peephole2 pass
can easily transform interesting instructions into other forms, just like
ldrd/strd for ARM.
The new infrastructure can handle different kinds of fusion in one pass.
It's also easy to extend for new fusion cases, all it takes is to identify
instructions which want to be fused and assign new fusion priorities to
them. Also as Mike suggested last time, the patch can be very simple by
reusing existing scheduler facility.
I collected performance data for both cortex-a15 and cortex-a57 (with a
local peephole ldp/stp patch), the benchmarks can be obviously improved on
arm/aarch64. I also collected instrument data about how many load/store
pairs are found. For the four versions of load/store pair patches:
0) The original Mike's patch.
1) My original prototype patch.
2) Cleaned up pass of Mike (with implementation bugs resolved).
3) This new prototype fusion pass.
The numbers of paired opportunities satisfy below relations:
3 * N0 ~ N1 ~ N2 < N3
For example, for one benchmark suite, we have:
N0 ~= 1300
N1/N2 ~= 5000
N3 ~= 7500
As a matter of fact, if we move sched_fusion and peephole2 pass after
register renaming (~11000 for above benchmark suite), then enable register
renaming pass, this patch can find even more load store pairs. But rename
pass has its own impact on performance and we need more benchmark data
before doing that change.
Of course, this patch is no the perfect solution, it does miss load/store
pair in some corner cases which have complicated instruction dependencies.
Actually it breaks one load/store pair test on armv6 because of the corner
case, that's why the pass is disabled by default on non-armv7 processors. I
may investigate the failure and try to enable the pass for all arm targets
in the future.
So any comments on this?
2014-09-30 Bin Cheng <***@arm.com>
Mike Stump <***@comcast.net>
* timevar.def (TV_SCHED_FUSION): New time var.
* passes.def (pass_sched_fusion): New pass.
* config/arm/arm.c (TARGET_SCHED_FUSION_PRIORITY): New.
(extract_base_offset_in_addr, fusion_load_store): New.
(arm_sched_fusion_priority): New.
(arm_option_override): Disable scheduling fusion on non-armv7
processors by default.
* sched-int.h (struct _haifa_insn_data): New field.
(INSN_FUSION_PRIORITY, FUSION_MAX_PRIORITY, sched_fusion): New.
* sched-rgn.c (rest_of_handle_sched_fusion): New.
(pass_data_sched_fusion, pass_sched_fusion): New.
(make_pass_sched_fusion): New.
* haifa-sched.c (sched_fusion): New.
(insn_cost): Handle sched_fusion.
(priority): Handle sched_fusion by calling target hook.
(enum rfs_decision): New enum value.
(rfs_str): New element for RFS_FUSION.
(rank_for_schedule): Support sched_fusion.
(schedule_insn, max_issue, prune_ready_list): Handle sched_fusion.
(schedule_block, fix_tick_ready): Handle sched_fusion.
* common.opt (flag_schedule_fusion): New.
* tree-pass.h (make_pass_sched_fusion): New.
* target.def (fusion_priority): New.
* doc/tm.texi.in (TARGET_SCHED_FUSION_PRIORITY): New.
* doc/tm.texi: Regenerated.
* doc/invoke.texi (-fschedule-fusion): New.
gcc/testsuite/ChangeLog
2014-09-30 Bin Cheng <***@arm.com>
* gcc.target/arm/ldrd-strd-pair-1.c: New test.
* gcc.target/arm/vfp-1.c: Improve scanning string.