[PATCH RFC]Pair load store instructions using a generic scheduling fusion pass

Then I decided to take one step forward to introduce a generic
instruction fusion infrastructure in GCC, because in essence, load/store
pair is nothing different with other instruction fusion, all these optimizations
want is to push instructions together in instruction flow.

I like the step you took. I had exactly this in mind when I wrote the original.

N0 ~= 1300
N1/N2 ~= 5000
N3 ~= 7500

Nice. Would be nice to see metrics for time to ensure that the code isn’t actually worse (CSiBE and/or spec and/or some other). I didn’t have any large scale benchmark runs with my code and I did worry about extending lifetimes and register pressure.

I cleared up Mike's patch and fixed some implementation bugs in it

So, I’m wondering what the bugs or missed opportunities were? And, if they were of the type of problem that generated incorrect code or if they were of the type that was merely a missed opportunity.

Bin.Cheng

2014-10-06 09:57:55 UTC

I like the step you took. I had exactly this in mind when I wrote the original.

N0 ~= 1300
N1/N2 ~= 5000
N3 ~= 7500

Nice. Would be nice to see metrics for time to ensure that the code isn't actually worse (CSiBE and/or spec and/or some other). I didn't have any large scale benchmark runs with my code and I did worry about extending lifetimes and register pressure.

Hi Mike,
I did collect spec2k performance after pairing load/store using this
patch on both aarch64 and cortex-a15. The performance is improved
obviously, especially on cortex-a57. There are some (though not many)
benchmarks are regressed a little. There is no register pressure
problem here because this pass is put between register allocation and
sched2, I guess sched2 should resolve most pipeline hazards introduced
by this pass.

I cleared up Mike's patch and fixed some implementation bugs in it

So, I'm wondering what the bugs or missed opportunities were? And, if they were of the type of problem that generated incorrect code or if they were of the type that was merely a missed opportunity.

Just missed opportunity issues.

Thanks,
bin

Richard Biener

2014-10-06 11:32:28 UTC

I like the step you took. I had exactly this in mind when I wrote the original.

N0 ~= 1300
N1/N2 ~= 5000
N3 ~= 7500

How many merging opportunities does sched2 undo again? ISTR it
has the tendency of pushing stores down and loads up.

Richard.

I cleared up Mike's patch and fixed some implementation bugs in it

So, I'm wondering what the bugs or missed opportunities were? And, if they were of the type of problem that generated incorrect code or if they were of the type that was merely a missed opportunity.

Just missed opportunity issues.
Thanks,
bin

Mike Stump

2014-10-06 17:20:03 UTC

Post by Richard Biener
How many merging opportunities does sched2 undo again? ISTR it
has the tendency of pushing stores down and loads up.

So, the pass works by merging 2 or more loads into 1 load (at least on my port). sched2 would need to rip apart 1 load into 2 loads to be able to undo the real work. The non-real work, doesn’t matter any. Can sched2 rip apart a single load?

Bin.Cheng

2014-10-07 01:31:27 UTC

Post by Richard Biener
How many merging opportunities does sched2 undo again? ISTR it
has the tendency of pushing stores down and loads up.

On ARM and AARCH64, the two merged load/store are transformed into
single parallel insn by the following peephole2 pass, so that sched2
would not undo the fusion work. I though sched2 works on the basis of
instructions, and it isn't good practice to have sched2 do split work.

Thanks,
bin

Jeff Law

2014-10-08 05:28:25 UTC

Post by Richard Biener
How many merging opportunities does sched2 undo again? ISTR it
has the tendency of pushing stores down and loads up.

It's certainly advantageous for sched2 to split insns that generate
multiple instructions. Running after register allocation, sched2 is
ideal for splitting because the we know the alternative for each insn
and thus we can (possibly for the first time) accurately know if a
particular insn will generate multiple assembly instructions.

If the port has a splitter to rip apart a douple-word load into
single-word loads, then we'd obviously only want to do that in cases
where the double-word load actually generates > 1 assembly instruction.

Addressing issues in that space seems out of scope for Bin's work to me,
except perhaps for such issues on aarch64/arm which are Bin's primary
concerns.

jeff

Bin.Cheng

2014-10-08 05:52:18 UTC

Post by Richard Biener
How many merging opportunities does sched2 undo again? ISTR it
has the tendency of pushing stores down and loads up.

So, the pass works by merging 2 or more loads into 1 load (at least on my
port). sched2 would need to rip apart 1 load into 2 loads to be able to
undo the real work. The non-real work, doesn't matter any. Can sched2 rip
apart a single load?

It's certainly advantageous for sched2 to split insns that generate multiple
instructions. Running after register allocation, sched2 is ideal for
splitting because the we know the alternative for each insn and thus we can
(possibly for the first time) accurately know if a particular insn will
generate multiple assembly instructions.
If the port has a splitter to rip apart a douple-word load into single-word
loads, then we'd obviously only want to do that in cases where the
double-word load actually generates > 1 assembly instruction.
Addressing issues in that space seems out of scope for Bin's work to me,
except perhaps for such issues on aarch64/arm which are Bin's primary
concerns.

Hi Jeff,

Thanks very much for the explanation. Very likely I am wrong here,
but seems what you mentioned fits to pass_split_before_sched2 very
well. Then I guess it would be nice if we can differentiate cases in
the first place by generating different patterns, rather than split
some of instructions later. Though I have no idea if we can do that
or not.

For arm/aarch64, I guess it's not an issue, otherwise the peephole2
won't work at all. ARM maintainers should have answer to this.

jeff

Ramana Radhakrishnan

2014-10-08 10:27:54 UTC

If the port has a splitter to rip apart a douple-word load into single-word loads, then we'd obviously only want to do that in cases where the double-word load actually generates > 1 assembly instruction.

Or indeed if it is really a performance win. And I think that should
purely be a per port / micro-architectural decision .

Post by Bin.Cheng
For arm/aarch64, I guess it's not an issue, otherwise the peephole2
won't work at all. ARM maintainers should have answer to this.

Generating more ldrd's and strd's will be beneficial in the ARM and
the AArch64 port - we save code size and start using more memory
bandwidth available per instruction on most higher end cores that I'm
aware of. Even on the smaller microcontrollers I expect it to be a win
because you've saved code size. There may well be pathological cases
given we've shortened some dependencies or increased lifetimes of
others but overall I'd expect it to be more positive than negative.

I also expect this to be more effective in the T32 (Thumb2) ISA and
AArch64 because ldrd/ strd and ldp / stp respectively can work with
any registers unlike the A32 ISA where the registers loaded or stored
must be consecutive registers. I'm hoping for some more review on the
generic bits before looking into the backend implementation in the
expectation that this is the direction folks want to proceed.

regards
Ramana

jeff

Jeff Law

2014-10-08 22:21:43 UTC

Post by Ramana Radhakrishnan

Or indeed if it is really a performance win. And I think that should
purely be a per port / micro-architectural decision .

Agreed.

Post by Ramana Radhakrishnan
Generating more ldrd's and strd's will be beneficial in the ARM and
the AArch64 port - we save code size and start using more memory
bandwidth available per instruction on most higher end cores that I'm
aware of. Even on the smaller microcontrollers I expect it to be a win
because you've saved code size. There may well be pathological cases
given we've shortened some dependencies or increased lifetimes of
others but overall I'd expect it to be more positive than negative.

Agreed. I suspect there's multiple architectures where the results
would be similar -- code size improvements, more effective use of memory
bandwidth with possibly some pathological case(s) that we really
shouldn't worry too much about.

Post by Ramana Radhakrishnan
I also expect this to be more effective in the T32 (Thumb2) ISA and
AArch64 because ldrd/ strd and ldp / stp respectively can work with
any registers unlike the A32 ISA where the registers loaded or stored
must be consecutive registers. I'm hoping for some more review on the
generic bits before looking into the backend implementation in the
expectation that this is the direction folks want to proceed.

I've got some questions that I'm formulating to make sure I understand
how the facility is to be used. I may have to simply sit down with the
code installed on a test build and play with it.

However, to be clear, I really like the direction this work has gone.

Jeff

Mike Stump

2014-10-09 11:14:12 UTC