Discussion:
[lra] Improve initial program point density in lra-lives.c (was: Re: RFC: LRA for x86/x86-64 [7/9])
(too old to reply)
Steven Bosscher
2012-10-03 15:35:33 UTC
Permalink
On Tue, Oct 2, 2012 at 3:42 PM, Richard Sandiford
+/* Compress pseudo live ranges by removing program points where
+ nothing happens. Complexity of many algorithms in LRA is linear
+ function of program points number. To speed up the code we try to
+ minimize the number of the program points here. */
+static void
+remove_some_program_points_and_update_live_ranges (void)
Genuine question, but could we do this on the fly instead,
by not incrementing curr_point if the current point had no value?
I suppose the main complication would be checking cases where
all births are recorded by extending the previous just-closed live
range rather than starting a new one, in which case it's the previous
point that needs to be reused. Hmm...
Compressing live ranges: from 1742579 to 554532 - 31%
Compressing live ranges: from 1742569 to 73069 - 4%
look extreme, but they're actually the norm. For the same test case
(PR54146 still) but looking only at functions with initially 1000-9999
Compressing live ranges: from 1766 to 705 - 39%
Compressing live ranges: from 1449 to 370 - 25%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 2464 to 676 - 27%
Compressing live ranges: from 1433 to 379 - 26%
Compressing live ranges: from 1261 to 348 - 27%
Compressing live ranges: from 2835 to 755 - 26%
Compressing live ranges: from 5426 to 1678 - 30%
Compressing live ranges: from 5227 to 1477 - 28%
Compressing live ranges: from 1845 to 467 - 25%
Compressing live ranges: from 4868 to 1378 - 28%
Compressing live ranges: from 4875 to 1388 - 28%
Compressing live ranges: from 4882 to 1384 - 28%
Compressing live ranges: from 5688 to 1714 - 30%
Compressing live ranges: from 4943 to 1310 - 26%
Compressing live ranges: from 2976 to 792 - 26%
Compressing live ranges: from 5463 to 1526 - 27%
Compressing live ranges: from 2854 to 730 - 25%
Compressing live ranges: from 1810 to 745 - 41%
Compressing live ranges: from 2771 to 904 - 32%
Compressing live ranges: from 4916 to 1429 - 29%
Compressing live ranges: from 6505 to 2238 - 34%
Compressing live ranges: from 6493 to 166 - 2%
Compressing live ranges: from 5498 to 1734 - 31%
Compressing live ranges: from 1810 to 745 - 41%
Compressing live ranges: from 5043 to 1420 - 28%
Compressing live ranges: from 3094 to 788 - 25%
Compressing live ranges: from 4563 to 1311 - 28%
Compressing live ranges: from 4557 to 158 - 3%
Compressing live ranges: from 1050 to 274 - 26%
Compressing live ranges: from 1602 to 434 - 27%
Compressing live ranges: from 2474 to 600 - 24%
Compressing live ranges: from 2718 to 866 - 31%
Compressing live ranges: from 2097 to 716 - 34%
Compressing live ranges: from 4152 to 1099 - 26%
Compressing live ranges: from 5065 to 1514 - 29%
Compressing live ranges: from 1236 to 359 - 29%
Compressing live ranges: from 1722 to 541 - 31%
Compressing live ranges: from 5186 to 1401 - 27%
Unfortunately the dump doesn't mention how many live ranges could be
merged thanks to the compression.
It'd also be good to understand why the compression ratios are so
small, and consistently around ~30%.
This sentence is not clear to me. 30% means 3 times less points. It is
pretty good compression.
Maybe curr_point includes things
it should ignore (DEBUG_INSNs, NOTEs, ...)?
After the compression, there are only points important for conflict info,
i.e. only points where some reg dies or start living. Even more if on the
subsequent points there are only life starts or only deaths, they are
represented by one point after the compression.
With this patch the amount of compression is reduced. Without the
patch the compression ratio is typically around 30%, with the patch
this goes up to ~65%. The worst compression ratios (where worse is
better in this case :-) are:

$ grep Compressing log4 | egrep " [78].%"
Compressing live ranges: from 31 to 22 - 70%, pre_count 17, post_count 15
Compressing live ranges: from 128 to 94 - 73%, pre_count 87, post_count 77
Compressing live ranges: from 32 to 23 - 71%, pre_count 16, post_count 15
Compressing live ranges: from 38 to 29 - 76%, pre_count 19, post_count 18
Compressing live ranges: from 40 to 28 - 70%, pre_count 26, post_count 24
Compressing live ranges: from 40 to 28 - 70%, pre_count 26, post_count 24
Compressing live ranges: from 40 to 28 - 70%, pre_count 26, post_count 24
Compressing live ranges: from 60 to 43 - 71%, pre_count 37, post_count 33
Compressing live ranges: from 60 to 43 - 71%, pre_count 37, post_count 33
Compressing live ranges: from 125 to 89 - 71%, pre_count 71, post_count 62
Compressing live ranges: from 10 to 8 - 80%, pre_count 6, post_count 6
Compressing live ranges: from 54 to 39 - 72%, pre_count 35, post_count 30
Compressing live ranges: from 60 to 43 - 71%, pre_count 37, post_count 33
Compressing live ranges: from 60 to 43 - 71%, pre_count 37, post_count 33
Compressing live ranges: from 194 to 143 - 73%, pre_count 127, post_count 114
Compressing live ranges: from 2804 to 2047 - 73%, pre_count 2932,
post_count 2841
Compressing live ranges: from 194 to 143 - 73%, pre_count 127, post_count 114
Compressing live ranges: from 2184 to 1533 - 70%, pre_count 1458,
post_count 1366
Compressing live ranges: from 20 to 16 - 80%, pre_count 10, post_count 10

(pre_count and post_count are the number of live ranges before/after
compressing)

The "worst" result is this:
Compressing live ranges: from 726174 to 64496 - 8%, pre_count
40476128, post_count 12483414

But that's still a lot better than before the patch for the same function:
Compressing live ranges: from 1742569 to 73069 - 4%, pre_count
40842330, post_count 12479992

I'm not sure why I end up with fewer program points overall, I had
expected that remove_some_program_points_and_update_live_ranges would
make the post-compression numbers the same for the unpatched and
patched compiler. But oh well... There is no difference for >90% of
the cases so probably I just happen to trigger a few extra merges that
remove_some_program_points_and_update_live_ranges before the patch
somehow overlooked and kept disjoint.

Unfortunately, and much to my chagrin, there is almost no pay-off on
compile time. This is not completely unexpected, because the most
costly loops in e.g. remove_some_program_points_and_update_live_ranges
are unchanged (loop over all live ranges for all pseudos). Timings
before ("log") and after ("log4") the patch:

$ grep "LRA " log{,4} | egrep -o "^.*usr"
log: LRA non-specific : 60.94 ( 5%) usr
log: LRA virtuals elimination: 60.24 ( 5%) usr
log: LRA reload inheritance : 6.25 ( 1%) usr
log: LRA create live ranges : 181.79 (16%) usr
log: LRA hard reg assignment : 132.33 (11%) usr
log: LRA coalesce pseudo regs: 2.48 ( 0%) usr
log4: LRA non-specific : 60.52 ( 5%) usr
log4: LRA virtuals elimination: 62.44 ( 5%) usr
log4: LRA reload inheritance : 6.45 ( 1%) usr
log4: LRA create live ranges : 177.69 (15%) usr
log4: LRA hard reg assignment : 131.76 (11%) usr
log4: LRA coalesce pseudo regs: 2.60 ( 0%) usr

I was trying to make the
remove_some_program_points_and_update_live_ranges compression step
almost unnecessary (at least for the first iteration of
lra_create_live_ranges) because that would give a ~25% speedup for
"LRA create live ranges" for the PR54146 test case. But even with this
patch there are no functions for which
remove_some_program_points_and_update_live_ranges becomes a no-op :-(

Still, I would like to propose this for the lra-branch. Hopefully it's
possible to find out how to get closer to 100%, there is a potential
win there.

Bootstrapped and tested on x86_64-unknown-linux-gnu.
I also checked and double-checked that there are no code generation
differences on all pre-processed source files of cc1.

OK for lra-branch?

Ciao!
Steven
Vladimir Makarov
2012-10-04 16:12:11 UTC
Permalink
On Thu, Oct 4, 2012 at 5:30 AM, Vladimir Makarov
I was going to look at this code too but I was interesting in
generation of
less points and live ranges. It is strange that in my profiles,
remove_some_program_points_and_update_live_ranges takes 0.6% of
compiler
time on these huge tests. So I was not interesting to speed up the
function and may be therefore you have no visible change in compilation
time.
Right. The compression algorithm doesn't care much about the initial
number of program points, only about the number of live ranges before
and after compression. I had expected a bigger effect on the number of
live ranges before compression.
0.6% sounds really very different from my timings. How much time does
create_start_finish_chains take for you?
0.65% (2.78s).
Actually, I have a profile but I am not sure now that it is for
PR54146. It might be for PR26854.
I'll check it again to be sure.
Not it looks about the same.
Vladimir Makarov
2012-10-04 17:44:03 UTC
Permalink
Post by Vladimir Makarov
0.6% sounds really very different from my timings. How much time does
create_start_finish_chains take for you?
0.65% (2.78s).
Actually, I have a profile but I am not sure now that it is for PR54146.
It might be for PR26854.
I'll check it again to be sure.
Not it looks about the same.
Well, that's very strange. Maybe we measure these things differently?
I just hi-hack a timevar, so I measure e.g. the time spent in
Index: lra-lives.c
===================================================================
--- lra-lives.c (revision 192052)
+++ lra-lives.c (working copy)
@@ -770,6 +812,7 @@ create_start_finish_chains (void)
int i, max_regno;
lra_live_range_t r;
+timevar_push (TV_CPROP);
lra_start_point_ranges = XCNEWVEC (lra_live_range_t, lra_live_max_point);
lra_finish_point_ranges = XCNEWVEC (lra_live_range_t, lra_live_max_point);
max_regno = max_reg_num ();
@@ -783,6 +826,7 @@ create_start_finish_chains (void)
lra_finish_point_ranges[r->finish] = r;
}
}
+timevar_pop (TV_CPROP);
}
/* Rebuild LRA_START_POINT_RANGES and LRA_FINISH_POINT_RANGES after
CPROP : 43.14 ( 4%) usr
integrated RA : 200.81 (17%) usr
LRA non-specific : 62.18 ( 5%) usr
LRA virtuals elimination: 61.71 ( 5%) usr
LRA reload inheritance : 6.41 ( 1%) usr
LRA create live ranges : 139.75 (13%) usr
LRA hard reg assignment : 130.90 (11%) usr
LRA coalesce pseudo regs: 2.45 ( 0%) usr
reload : 9.09 ( 1%) usr
"Crude, but efficient" (tm) :-)
How do you measure the time spent in that function, and in
remove_some_program_points_and_update_live_ranges?
You use AMD and I use Intel. So it may be different with cache point of
view.

Another thing is that I used gprof (-pg was used for bitmap.o lra*.o and
ira*.o). Your measurements are more accurate, I think, because it is
without instrumentation and bitmap.o takes too much time. Bitmap does
not work well in this case because they are too big and sparse.
Vladimir Makarov
2012-10-04 19:19:19 UTC
Permalink
Post by Vladimir Makarov
On Thu, Oct 4, 2012 at 6:12 PM, Vladimir Makarov
CPROP : 43.14 ( 4%) usr
integrated RA : 200.81 (17%) usr
LRA non-specific : 62.18 ( 5%) usr
LRA virtuals elimination: 61.71 ( 5%) usr
LRA reload inheritance : 6.41 ( 1%) usr
LRA create live ranges : 139.75 (13%) usr
LRA hard reg assignment : 130.90 (11%) usr
LRA coalesce pseudo regs: 2.45 ( 0%) usr
reload : 9.09 ( 1%) usr
"Crude, but efficient" (tm) :-)
How do you measure the time spent in that function, and in
remove_some_program_points_and_update_live_ranges?
You use AMD and I use Intel. So it may be different with cache point
of view.
Another thing is that I used gprof (-pg was used for bitmap.o lra*.o
and ira*.o). Your measurements are more accurate, I think, because it
is without instrumentation and bitmap.o takes too much time. Bitmap
does not work well in this case because they are too big and sparse.
Yes, gcc17 (AMD) behaviour is very different from Intel machines. I
think that is why we have so different numbers. Only
create_start_and_finish_chains takes 2.4% (28sec) according to gprof on
slow.cc (before my last patch). Also on AMD machine find_hard_regno_for
is on the first place (on Intel machines, several bitmap functions are
on the 1st place). That is the function I wanted to look at more later
(to implement some simpler algorithm for huge functions).

I think, the importance of your patch will be even more important as my
last patch increases % spent in lra-lives.c. So thank you very much,
Steven.

I'd like to play more with your patch and I'll give you an approval to
commit probably tomorrow.
Steven Bosscher
2012-10-04 16:56:09 UTC
Permalink
Post by Vladimir Makarov
0.6% sounds really very different from my timings. How much time does
create_start_finish_chains take for you?
0.65% (2.78s).
Actually, I have a profile but I am not sure now that it is for PR54146.
It might be for PR26854.
I'll check it again to be sure.
Not it looks about the same.
Well, that's very strange. Maybe we measure these things differently?
I just hi-hack a timevar, so I measure e.g. the time spent in
create_start_finish_chains like so:

Index: lra-lives.c
===================================================================
--- lra-lives.c (revision 192052)
+++ lra-lives.c (working copy)
@@ -770,6 +812,7 @@ create_start_finish_chains (void)
int i, max_regno;
lra_live_range_t r;

+timevar_push (TV_CPROP);
lra_start_point_ranges = XCNEWVEC (lra_live_range_t, lra_live_max_point);
lra_finish_point_ranges = XCNEWVEC (lra_live_range_t, lra_live_max_point);
max_regno = max_reg_num ();
@@ -783,6 +826,7 @@ create_start_finish_chains (void)
lra_finish_point_ranges[r->finish] = r;
}
}
+timevar_pop (TV_CPROP);
}

/* Rebuild LRA_START_POINT_RANGES and LRA_FINISH_POINT_RANGES after


so that I get the timings in the -ftime-report like so:

CPROP : 43.14 ( 4%) usr
integrated RA : 200.81 (17%) usr
LRA non-specific : 62.18 ( 5%) usr
LRA virtuals elimination: 61.71 ( 5%) usr
LRA reload inheritance : 6.41 ( 1%) usr
LRA create live ranges : 139.75 (13%) usr
LRA hard reg assignment : 130.90 (11%) usr
LRA coalesce pseudo regs: 2.45 ( 0%) usr
reload : 9.09 ( 1%) usr

"Crude, but efficient" (tm) :-)

How do you measure the time spent in that function, and in
remove_some_program_points_and_update_live_ranges?

Ciao!
Steven
Steven Bosscher
2012-10-04 17:14:13 UTC
Permalink
"Crude, but efficient" (tm) :-)
BTW with a similar approach I also time other bits of process_bb_lives:

timevar_push (TV_HOIST);
/* See if we'll need an increment at the end of this basic block.
An increment is needed if the PSEUDOS_LIVE set is not empty,
to make sure the finish points are set up correctly. */
need_curr_point_incr = (sparseset_cardinality (pseudos_live) > 0);
EXECUTE_IF_SET_IN_SPARSESET (pseudos_live, i)
mark_pseudo_dead (i, curr_point);
timevar_pop (TV_HOIST);
timevar_push (TV_PRE);
EXECUTE_IF_SET_IN_SPARSESET (pseudos_live_through_calls, i)
if (bitmap_bit_p (DF_LR_IN (bb), i))
check_pseudos_live_through_calls (i);
timevar_pop (TV_PRE);

PRE : 12.20 ( 1%) usr
code hoisting : 34.03 ( 3%) usr

IOW that's ~46s out of ~180s *not* spent in walking the insns in
process_bb_lives!

I think this is due to the opening/closing of chains for all pseudos
live through the basic block at the start and end of process_bb_lives.
For this test case the cardinality of pseudos_live is as high as
max_reg_num/2 at its peak.

I tried to keep pseudos_live open for bb->prev_bb if there is a
find_edge(bb->prev_bb,bb), but that's when I ran into a problem with
inconsistent liveness data (that's what this message was about:
http://gcc.gnu.org/ml/gcc/2012-10/msg00035.html).

Ciao!
Steven
Steven Bosscher
2012-10-04 12:59:28 UTC
Permalink
On Thu, Oct 4, 2012 at 1:30 PM, Richard Guenther
Isn't _REVERSE vs. non-_RESERVE still kind-of "random" order?
Not at this stage. For cfglayout mode I would answer yes, but IRA/LRA
operates in cfgrtl mode, so the sequence of insns and basic blocks
must match. Therefore, if you walk the basic blocks in reverse, and
the insns in each basic block in reverse, you effectively work on a,
let's say, "reverse extended basic block" (??) in that your program
points are sequential across fallthrough edges.
Thus, doesn't
the above show there exists an optimal order for processing which we could use?
There may be a smarter order: Could even walk blocks in that order if
you know a priori what path through the CFG minimizes the length of
the live range chains. But to know that order, you have to build the
chains. So chicken-and-egg...
(I realize _REVERSE is a simple solution, but might there not exist a
pathological case where _REVERSE is even worse than non-_REVERSE?)
Intuitively, I'm certain that _REVERSE is always better than
non-_REVERSE, although I don't know how to prove that :-)

Ciao!
Steven
Steven Bosscher
2012-10-05 21:53:12 UTC
Permalink
Post by Steven Bosscher
On Thu, Oct 4, 2012 at 1:30 PM, Richard Guenther
Isn't _REVERSE vs. non-_RESERVE still kind-of "random" order?
Not at this stage. For cfglayout mode I would answer yes, but IRA/LRA
operates in cfgrtl mode, so the sequence of insns and basic blocks
must match. Therefore, if you walk the basic blocks in reverse, and
the insns in each basic block in reverse, you effectively work on a,
let's say, "reverse extended basic block" (??) in that your program
points are sequential across fallthrough edges.
Thus, doesn't
the above show there exists an optimal order for processing which we could use?
There may be a smarter order: Could even walk blocks in that order if
you know a priori what path through the CFG minimizes the length of
the live range chains. But to know that order, you have to build the
chains. So chicken-and-egg...
Richi, you had me inspired, and I remembered your (still disgusting
;-)) my_rev_post_order_compute hack.

A better order than walking basic blocks in reverse, is to walk them
in topological order on the reverse CFG. This results in very long
live ranges being compressed into very short chains because the
program points become almost sequential for them.

So here's one more patch:

@@ -994,8 +1092,16 @@ lra_create_live_ranges (bool all_p)
curr_point = 0;
point_freq_vec = VEC_alloc (int, heap, get_max_uid () * 2);
lra_point_freq = VEC_address (int, point_freq_vec);
- FOR_EACH_BB (bb)
- process_bb_lives (bb);
+ int *post_order_rev_cfg = XNEWVEC (int, last_basic_block);
+ int n_blocks_inverted = inverted_post_order_compute (post_order_rev_cfg);
+ lra_assert (n_blocks_inverted == n_basic_blocks);
+ for (i = n_blocks_inverted - 1; i >= 0; --i)
+ {
+ bb = BASIC_BLOCK (post_order_rev_cfg[i]);
+ if (bb == EXIT_BLOCK_PTR || bb == ENTRY_BLOCK_PTR)
+ continue;
+ process_bb_lives (bb, curr_point);
+ } // HACK13 must add free (post_order_rev_cfg); here
lra_live_max_point = curr_point;
create_start_finish_chains ();
if (lra_dump_file != NULL)


Without this patch:
Compressing live ranges: from 700458 to 391665 - 55%, pre_count
40730653, post_count 34363983
max per-reg pre_count 12978 (228090, 2 defs, 2 uses) (reg/f:DI 228090
[ SR.25009 ])
max per-reg post_count 10967 (228090, 2 defs, 2 uses) (reg/f:DI 228090
[ SR.25009 ])

With this patch:
Compressing live ranges: from 700458 to 372585 - 53%, pre_count
283937, post_count 271120
max per-reg pre_count 545 (230653, 542 defs, 542 uses) (reg/f:DI
230653 [ SR.13303 ])
max per-reg post_count 544 (230649, 542 defs, 542 uses) (reg/f:DI
230649 [ SR.13305 ])

(the per-reg counts are the lengths of the live range chains for the
mentioned regno).

So the patch results in fewer program points, and way, waaaaay shorter
live range chains. Needless to say, really, compile time benefits
significantly also. Time spent in create_start_finish_chains:
Without this patch, 19.75s (2%) usr
With this patch, 0.14s (0%)

Vlad, is this something you also already tried? If not, then I'm going
to bootstrap&test this (on top of my pending other patches on
lra-lives.c).

Ciao!
Steven
Vladimir Makarov
2012-10-06 02:52:13 UTC
Permalink
Post by Steven Bosscher
Post by Steven Bosscher
On Thu, Oct 4, 2012 at 1:30 PM, Richard Guenther
Isn't _REVERSE vs. non-_RESERVE still kind-of "random" order?
Not at this stage. For cfglayout mode I would answer yes, but IRA/LRA
operates in cfgrtl mode, so the sequence of insns and basic blocks
must match. Therefore, if you walk the basic blocks in reverse, and
the insns in each basic block in reverse, you effectively work on a,
let's say, "reverse extended basic block" (??) in that your program
points are sequential across fallthrough edges.
Thus, doesn't
the above show there exists an optimal order for processing which we could use?
There may be a smarter order: Could even walk blocks in that order if
you know a priori what path through the CFG minimizes the length of
the live range chains. But to know that order, you have to build the
chains. So chicken-and-egg...
Richi, you had me inspired, and I remembered your (still disgusting
;-)) my_rev_post_order_compute hack.
A better order than walking basic blocks in reverse, is to walk them
in topological order on the reverse CFG. This results in very long
live ranges being compressed into very short chains because the
program points become almost sequential for them.
@@ -994,8 +1092,16 @@ lra_create_live_ranges (bool all_p)
curr_point = 0;
point_freq_vec = VEC_alloc (int, heap, get_max_uid () * 2);
lra_point_freq = VEC_address (int, point_freq_vec);
- FOR_EACH_BB (bb)
- process_bb_lives (bb);
+ int *post_order_rev_cfg = XNEWVEC (int, last_basic_block);
+ int n_blocks_inverted = inverted_post_order_compute (post_order_rev_cfg);
+ lra_assert (n_blocks_inverted == n_basic_blocks);
+ for (i = n_blocks_inverted - 1; i >= 0; --i)
+ {
+ bb = BASIC_BLOCK (post_order_rev_cfg[i]);
+ if (bb == EXIT_BLOCK_PTR || bb == ENTRY_BLOCK_PTR)
+ continue;
+ process_bb_lives (bb, curr_point);
+ } // HACK13 must add free (post_order_rev_cfg); here
lra_live_max_point = curr_point;
create_start_finish_chains ();
if (lra_dump_file != NULL)
Compressing live ranges: from 700458 to 391665 - 55%, pre_count
40730653, post_count 34363983
max per-reg pre_count 12978 (228090, 2 defs, 2 uses) (reg/f:DI 228090
[ SR.25009 ])
max per-reg post_count 10967 (228090, 2 defs, 2 uses) (reg/f:DI 228090
[ SR.25009 ])
Compressing live ranges: from 700458 to 372585 - 53%, pre_count
283937, post_count 271120
max per-reg pre_count 545 (230653, 542 defs, 542 uses) (reg/f:DI
230653 [ SR.13303 ])
max per-reg post_count 544 (230649, 542 defs, 542 uses) (reg/f:DI
230649 [ SR.13305 ])
(the per-reg counts are the lengths of the live range chains for the
mentioned regno).
Yes, that is impressive. But I think, #points in a live range is a real
parameter of the complexity.
Post by Steven Bosscher
So the patch results in fewer program points, and way, waaaaay shorter
live range chains. Needless to say, really, compile time benefits
Without this patch, 19.75s (2%) usr
With this patch, 0.14s (0%)
Vlad, is this something you also already tried? If not, then I'm going
to bootstrap&test this (on top of my pending other patches on
lra-lives.c).
No, I was busy with fixing a bug in LRA. I played a bit with your patch
(including FOR_EACH_BB_REVERSE) on Intel machine. But I did not see a
visible improvement on slow.cc.

The code size is a bit different too and that I can not explain because
#points or #element in live range list are not present in assignment
decisions. I have no time to find why it is different. In any case I
think it is ok and there is reasonable explanation for this.

As on AMD machines, the improvement is really significant. I have no
objections to the previous patch (including FOR_EACH_BB_REVERSE) and the
patch you put here. Of course if it is committed with a proper changelog.

Thanks for working on this, Steven. I'd like to attack
find_hard_regno_for when I have time. I should say work on improving
compilation speed for this particular case is a bit addictive for me.
Steven Bosscher
2012-10-04 16:45:32 UTC
Permalink
Wow. I did not have such effect. What machine do you use?
I do all my testing on gcc17.

Ciao!
Steven
Vladimir Makarov
2012-10-04 03:30:10 UTC
Permalink
Post by Steven Bosscher
On Tue, Oct 2, 2012 at 3:42 PM, Richard Sandiford
+/* Compress pseudo live ranges by removing program points where
+ nothing happens. Complexity of many algorithms in LRA is linear
+ function of program points number. To speed up the code we try to
+ minimize the number of the program points here. */
+static void
+remove_some_program_points_and_update_live_ranges (void)
Genuine question, but could we do this on the fly instead,
by not incrementing curr_point if the current point had no value?
I suppose the main complication would be checking cases where
all births are recorded by extending the previous just-closed live
range rather than starting a new one, in which case it's the previous
point that needs to be reused. Hmm...
Compressing live ranges: from 1742579 to 554532 - 31%
Compressing live ranges: from 1742569 to 73069 - 4%
look extreme, but they're actually the norm. For the same test case
(PR54146 still) but looking only at functions with initially 1000-9999
Compressing live ranges: from 1766 to 705 - 39%
Compressing live ranges: from 1449 to 370 - 25%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 3939 to 1093 - 27%
Compressing live ranges: from 2464 to 676 - 27%
Compressing live ranges: from 1433 to 379 - 26%
Compressing live ranges: from 1261 to 348 - 27%
Compressing live ranges: from 2835 to 755 - 26%
Compressing live ranges: from 5426 to 1678 - 30%
Compressing live ranges: from 5227 to 1477 - 28%
Compressing live ranges: from 1845 to 467 - 25%
Compressing live ranges: from 4868 to 1378 - 28%
Compressing live ranges: from 4875 to 1388 - 28%
Compressing live ranges: from 4882 to 1384 - 28%
Compressing live ranges: from 5688 to 1714 - 30%
Compressing live ranges: from 4943 to 1310 - 26%
Compressing live ranges: from 2976 to 792 - 26%
Compressing live ranges: from 5463 to 1526 - 27%
Compressing live ranges: from 2854 to 730 - 25%
Compressing live ranges: from 1810 to 745 - 41%
Compressing live ranges: from 2771 to 904 - 32%
Compressing live ranges: from 4916 to 1429 - 29%
Compressing live ranges: from 6505 to 2238 - 34%
Compressing live ranges: from 6493 to 166 - 2%
Compressing live ranges: from 5498 to 1734 - 31%
Compressing live ranges: from 1810 to 745 - 41%
Compressing live ranges: from 5043 to 1420 - 28%
Compressing live ranges: from 3094 to 788 - 25%
Compressing live ranges: from 4563 to 1311 - 28%
Compressing live ranges: from 4557 to 158 - 3%
Compressing live ranges: from 1050 to 274 - 26%
Compressing live ranges: from 1602 to 434 - 27%
Compressing live ranges: from 2474 to 600 - 24%
Compressing live ranges: from 2718 to 866 - 31%
Compressing live ranges: from 2097 to 716 - 34%
Compressing live ranges: from 4152 to 1099 - 26%
Compressing live ranges: from 5065 to 1514 - 29%
Compressing live ranges: from 1236 to 359 - 29%
Compressing live ranges: from 1722 to 541 - 31%
Compressing live ranges: from 5186 to 1401 - 27%
Unfortunately the dump doesn't mention how many live ranges could be
merged thanks to the compression.
It'd also be good to understand why the compression ratios are so
small, and consistently around ~30%.
This sentence is not clear to me. 30% means 3 times less points. It is
pretty good compression.
Maybe curr_point includes things
it should ignore (DEBUG_INSNs, NOTEs, ...)?
After the compression, there are only points important for conflict info,
i.e. only points where some reg dies or start living. Even more if on the
subsequent points there are only life starts or only deaths, they are
represented by one point after the compression.
With this patch the amount of compression is reduced. Without the
patch the compression ratio is typically around 30%, with the patch
this goes up to ~65%. The worst compression ratios (where worse is
$ grep Compressing log4 | egrep " [78].%"
Compressing live ranges: from 31 to 22 - 70%, pre_count 17, post_count 15
Compressing live ranges: from 128 to 94 - 73%, pre_count 87, post_count 77
Compressing live ranges: from 32 to 23 - 71%, pre_count 16, post_count 15
Compressing live ranges: from 38 to 29 - 76%, pre_count 19, post_count 18
Compressing live ranges: from 40 to 28 - 70%, pre_count 26, post_count 24
Compressing live ranges: from 40 to 28 - 70%, pre_count 26, post_count 24
Compressing live ranges: from 40 to 28 - 70%, pre_count 26, post_count 24
Compressing live ranges: from 60 to 43 - 71%, pre_count 37, post_count 33
Compressing live ranges: from 60 to 43 - 71%, pre_count 37, post_count 33
Compressing live ranges: from 125 to 89 - 71%, pre_count 71, post_count 62
Compressing live ranges: from 10 to 8 - 80%, pre_count 6, post_count 6
Compressing live ranges: from 54 to 39 - 72%, pre_count 35, post_count 30
Compressing live ranges: from 60 to 43 - 71%, pre_count 37, post_count 33
Compressing live ranges: from 60 to 43 - 71%, pre_count 37, post_count 33
Compressing live ranges: from 194 to 143 - 73%, pre_count 127, post_count 114
Compressing live ranges: from 2804 to 2047 - 73%, pre_count 2932,
post_count 2841
Compressing live ranges: from 194 to 143 - 73%, pre_count 127, post_count 114
Compressing live ranges: from 2184 to 1533 - 70%, pre_count 1458,
post_count 1366
Compressing live ranges: from 20 to 16 - 80%, pre_count 10, post_count 10
(pre_count and post_count are the number of live ranges before/after
compressing)
Compressing live ranges: from 726174 to 64496 - 8%, pre_count
40476128, post_count 12483414
Compressing live ranges: from 1742569 to 73069 - 4%, pre_count
40842330, post_count 12479992
I'm not sure why I end up with fewer program points overall, I had
expected that remove_some_program_points_and_update_live_ranges would
make the post-compression numbers the same for the unpatched and
patched compiler. But oh well... There is no difference for >90% of
the cases so probably I just happen to trigger a few extra merges that
remove_some_program_points_and_update_live_ranges before the patch
somehow overlooked and kept disjoint.
Unfortunately, and much to my chagrin, there is almost no pay-off on
compile time. This is not completely unexpected, because the most
costly loops in e.g. remove_some_program_points_and_update_live_ranges
are unchanged (loop over all live ranges for all pseudos). Timings
$ grep "LRA " log{,4} | egrep -o "^.*usr"
log: LRA non-specific : 60.94 ( 5%) usr
log: LRA virtuals elimination: 60.24 ( 5%) usr
log: LRA reload inheritance : 6.25 ( 1%) usr
log: LRA create live ranges : 181.79 (16%) usr
log: LRA hard reg assignment : 132.33 (11%) usr
log: LRA coalesce pseudo regs: 2.48 ( 0%) usr
log4: LRA non-specific : 60.52 ( 5%) usr
log4: LRA virtuals elimination: 62.44 ( 5%) usr
log4: LRA reload inheritance : 6.45 ( 1%) usr
log4: LRA create live ranges : 177.69 (15%) usr
log4: LRA hard reg assignment : 131.76 (11%) usr
log4: LRA coalesce pseudo regs: 2.60 ( 0%) usr
I was trying to make the
remove_some_program_points_and_update_live_ranges compression step
almost unnecessary (at least for the first iteration of
lra_create_live_ranges) because that would give a ~25% speedup for
"LRA create live ranges" for the PR54146 test case. But even with this
patch there are no functions for which
remove_some_program_points_and_update_live_ranges becomes a no-op :-(
Still, I would like to propose this for the lra-branch. Hopefully it's
possible to find out how to get closer to 100%, there is a potential
win there.
Bootstrapped and tested on x86_64-unknown-linux-gnu.
I also checked and double-checked that there are no code generation
differences on all pre-processed source files of cc1.
OK for lra-branch?
I was going to look at this code too but I was interesting in generation
of less points and live ranges. It is strange that in my profiles,
remove_some_program_points_and_update_live_ranges takes 0.6% of compiler
time on these huge tests. So I was not interesting to speed up the
function and may be therefore you have no visible change in compilation
time.

I don't object the idea of the patch. I need some time to look at it (the different results on a function is a bit scary for me) and check simulator times on other tests.

Thanks for working on this issue, Steven.
Vladimir Makarov
2012-10-04 15:30:14 UTC
Permalink
I was going to look at this code too but I was interesting in generation of
less points and live ranges. It is strange that in my profiles,
remove_some_program_points_and_update_live_ranges takes 0.6% of compiler
time on these huge tests. So I was not interesting to speed up the
function and may be therefore you have no visible change in compilation
time.
Right. The compression algorithm doesn't care much about the initial
number of program points, only about the number of live ranges before
and after compression. I had expected a bigger effect on the number of
live ranges before compression.
0.6% sounds really very different from my timings. How much time does
create_start_finish_chains take for you?
I don't object the idea of the patch. I need some time to look at it (the
different results on a function is a bit scary for me) and check simulator
times on other tests.
Understood.
@@ -994,8 +1044,8 @@ lra_create_live_ranges (bool all_p)
curr_point = 0;
point_freq_vec = VEC_alloc (int, heap, get_max_uid () * 2);
lra_point_freq = VEC_address (int, point_freq_vec);
- FOR_EACH_BB (bb)
- process_bb_lives (bb);
+ FOR_EACH_BB_REVERSE (bb)
+ process_bb_lives (bb, curr_point);
lra_live_max_point = curr_point;
create_start_finish_chains ();
if (lra_dump_file != NULL)
I think this should result in more live ranges being merged. Here's
why I think so, based on my far worse understanding of this code than
yours, so forgive me if I'm Completely Wrong :-)
No, you are not wrong.
Two days ago, I worked on patch which contains the same code. The patch
actually takes EBB into account to decrease # calls of mark_pseudo_live
at the beginning of process_bb_lives and mark_pseudo_dead at the
function end and for that I needed FOR_EACH_BB_REVERSE. The patch was
half baked (it did not checked hard regs live changes at the end of BB
to set up right hard reg conflicts for pseudos) but it gave an idea how
much I can get from this. It is not bad but not what I expected. So I
stopped work on this. But we still should work on these ideas as they
improve LRA speed in small steps (many small steps will create a visible
effect).

We can really solve scalability problem only by using simpler but still
good enough algorithms (too simple algorithms result in big code size
and actually even in worse compilation times). I've been working on it
and I'll send a patch soon.
process_bb_lives walks insns in the basic block from last to first, so
say you have a basic block chain 1->2->3, and each block has 4 insns,
then AFAIU the program points in block 1 will be [4,3,2,1], in block 2
it will be [8,7,6,5], and in block 3 it will be [12,11,10,9]. Say a
reg is used in block 3 at point 11, and set in block at point 3. Then
this reg will have a live range chain [3-1],[8-5],[12-11].
If you visit the basic blocks in reverse order, the program points
will be: 1:[12,11,10,9], 2:[8,7,6,5], 3:[4,3,2,1]. Now the same reg
will be set at point 11 and used at point 3, and the live range chain
will be just [11-3].
Steven Bosscher
2012-10-04 06:57:26 UTC
Permalink
I was going to look at this code too but I was interesting in generation of
less points and live ranges. It is strange that in my profiles,
remove_some_program_points_and_update_live_ranges takes 0.6% of compiler
time on these huge tests. So I was not interesting to speed up the
function and may be therefore you have no visible change in compilation
time.
Right. The compression algorithm doesn't care much about the initial
number of program points, only about the number of live ranges before
and after compression. I had expected a bigger effect on the number of
live ranges before compression.

0.6% sounds really very different from my timings. How much time does
create_start_finish_chains take for you?
I don't object the idea of the patch. I need some time to look at it (the
different results on a function is a bit scary for me) and check simulator
times on other tests.
Understood.

FWIW you can find the test results from before and after the patch here:
http://gcc.gnu.org/ml/gcc-testresults/2012-10/msg00267.html
http://gcc.gnu.org/ml/gcc-testresults/2012-10/msg00303.html

Ciao!
Steven
Vladimir Makarov
2012-10-04 15:45:51 UTC
Permalink
I was going to look at this code too but I was interesting in generation of
less points and live ranges. It is strange that in my profiles,
remove_some_program_points_and_update_live_ranges takes 0.6% of compiler
time on these huge tests. So I was not interesting to speed up the
function and may be therefore you have no visible change in compilation
time.
Right. The compression algorithm doesn't care much about the initial
number of program points, only about the number of live ranges before
and after compression. I had expected a bigger effect on the number of
live ranges before compression.
0.6% sounds really very different from my timings. How much time does
create_start_finish_chains take for you?
0.65% (2.78s).

Actually, I have a profile but I am not sure now that it is for
PR54146. It might be for PR26854.

I'll check it again to be sure.
Steven Bosscher
2012-10-04 07:24:04 UTC
Permalink
I was going to look at this code too but I was interesting in generation of
less points and live ranges. It is strange that in my profiles,
remove_some_program_points_and_update_live_ranges takes 0.6% of compiler
time on these huge tests. So I was not interesting to speed up the
function and may be therefore you have no visible change in compilation
time.
Right. The compression algorithm doesn't care much about the initial
number of program points, only about the number of live ranges before
and after compression. I had expected a bigger effect on the number of
live ranges before compression.
0.6% sounds really very different from my timings. How much time does
create_start_finish_chains take for you?
I don't object the idea of the patch. I need some time to look at it (the
different results on a function is a bit scary for me) and check simulator
times on other tests.
Understood.
BTW, it would be great if you can also look at this additional patch hunk:

@@ -994,8 +1044,8 @@ lra_create_live_ranges (bool all_p)
curr_point = 0;
point_freq_vec = VEC_alloc (int, heap, get_max_uid () * 2);
lra_point_freq = VEC_address (int, point_freq_vec);
- FOR_EACH_BB (bb)
- process_bb_lives (bb);
+ FOR_EACH_BB_REVERSE (bb)
+ process_bb_lives (bb, curr_point);
lra_live_max_point = curr_point;
create_start_finish_chains ();
if (lra_dump_file != NULL)

I think this should result in more live ranges being merged. Here's
why I think so, based on my far worse understanding of this code than
yours, so forgive me if I'm Completely Wrong :-)

process_bb_lives walks insns in the basic block from last to first, so
say you have a basic block chain 1->2->3, and each block has 4 insns,
then AFAIU the program points in block 1 will be [4,3,2,1], in block 2
it will be [8,7,6,5], and in block 3 it will be [12,11,10,9]. Say a
reg is used in block 3 at point 11, and set in block at point 3. Then
this reg will have a live range chain [3-1],[8-5],[12-11].

If you visit the basic blocks in reverse order, the program points
will be: 1:[12,11,10,9], 2:[8,7,6,5], 3:[4,3,2,1]. Now the same reg
will be set at point 11 and used at point 3, and the live range chain
will be just [11-3].

I'm experimenting with this extra hunk and report back here.

Ciao!
Steven
Steven Bosscher
2012-10-04 09:43:02 UTC
Permalink
Compressing live ranges: from 726174 to 64496 - 8%, pre_count 40476128, post_count 12483414
Compressing live ranges: from 1742569 to 73069 - 4%, pre_count 40842330, post_count 12479992
Walking basic blocks with FOR_EACH_BB_REVERSE gives:

Only FOR_EACH_BB_REVERSE:
Compressing live ranges: from 1742579 to 429746 - 24% pre_count
41106212, post_count 34376494
Compressing live ranges: from 1742569 to 63000 - 3% pre_count
40835340, post_count 11055747

FOR_EACH_BB_REVERSE + need_curr_point_incr:
Compressing live ranges: from 726184 to 416529 - 57% pre_count
40743516, post_count 34376846
Compressing live ranges: from 726174 to 61840 - 8% pre_count 40472806,
post_count 11055747

The combination of the two changes takes ~20s off the ~180s for "LRA
create live ranges".

Ciao!
Steven
Vladimir Makarov
2012-10-04 15:31:55 UTC
Permalink
Post by Steven Bosscher
Compressing live ranges: from 726174 to 64496 - 8%, pre_count 40476128, post_count 12483414
Compressing live ranges: from 1742569 to 73069 - 4%, pre_count 40842330, post_count 12479992
Compressing live ranges: from 1742579 to 429746 - 24% pre_count
41106212, post_count 34376494
Compressing live ranges: from 1742569 to 63000 - 3% pre_count
40835340, post_count 11055747
Compressing live ranges: from 726184 to 416529 - 57% pre_count
40743516, post_count 34376846
Compressing live ranges: from 726174 to 61840 - 8% pre_count 40472806,
post_count 11055747
The combination of the two changes takes ~20s off the ~180s for "LRA
create live ranges".
Wow. I did not have such effect. What machine do you use?
Richard Guenther
2012-10-04 11:30:22 UTC
Permalink
Post by Steven Bosscher
Compressing live ranges: from 726174 to 64496 - 8%, pre_count 40476128, post_count 12483414
Compressing live ranges: from 1742569 to 73069 - 4%, pre_count 40842330, post_count 12479992
Compressing live ranges: from 1742579 to 429746 - 24% pre_count
41106212, post_count 34376494
Compressing live ranges: from 1742569 to 63000 - 3% pre_count
40835340, post_count 11055747
Compressing live ranges: from 726184 to 416529 - 57% pre_count
40743516, post_count 34376846
Compressing live ranges: from 726174 to 61840 - 8% pre_count 40472806,
post_count 11055747
The combination of the two changes takes ~20s off the ~180s for "LRA
create live ranges".
Isn't _REVERSE vs. non-_RESERVE still kind-of "random" order? Thus, doesn't
the above show there exists an optimal order for processing which we could use?
(I realize _REVERSE is a simple solution, but might there not exist a
pathological
case where _REVERSE is even worse than non-_REVERSE?)

Richard.
Post by Steven Bosscher
Ciao!
Steven
Loading...