Discussion:
[PATCH] AutoFDO patch for trunk
Dehao Chen
2014-05-07 21:24:06 UTC
Permalink
Hi,

I'm planning to port the AutoFDO patch upstream. Attached is the
prepared patch. You can also find the patch in
http://codereview.appspot.com/99010043

I've tested the patch with SPECCPU2006. For the CINT2006 benchmarks,
the speedup comparison between O2, FDO and AutoFDO is as follows:

Reference: o2
(1): auto_fdo
(2): fdo

Benchmark Base:Reference (1) (2)
-----------------------------------------------------------------
spec/2006/int/C++/471.omnetpp 23.18 +3.11% +5.09%
spec/2006/int/C++/473.astar 21.15 +6.79% +9.80%
spec/2006/int/C++/483.xalancbmk 36.68 +11.56% +14.47%
spec/2006/int/C/400.perlbench 34.57 +6.59% +18.56%
spec/2006/int/C/401.bzip2 23.17 +0.95% +2.49%
spec/2006/int/C/403.gcc 32.33 +8.27% +9.76%
spec/2006/int/C/429.mcf 42.13 +4.72% +5.23%
spec/2006/int/C/445.gobmk 26.53 -1.39% +0.05%
spec/2006/int/C/456.hmmer 23.72 +7.12% +7.87%
spec/2006/int/C/458.sjeng 26.17 +4.65% +6.04%
spec/2006/int/C/462.libquantum 57.23 +4.04% +1.42%
spec/2006/int/C/464.h264ref 46.3 +1.07% +8.97%

geometric mean +4.73% +7.36%

The majority of the performance difference between AutoFDO and FDO
comes from the lack of instruction level discriminator support. Cary
Coutant is planning to port that patch upstream too.

Please let me know if you have any question about this patch, and
thanks in advance for reviewing such a huge patch.

Dehao
Xinliang David Li
2014-05-07 21:47:43 UTC
Permalink
Have you announced the autofdo profile tool to gcc list?

David
Post by Dehao Chen
Hi,
I'm planning to port the AutoFDO patch upstream. Attached is the
prepared patch. You can also find the patch in
http://codereview.appspot.com/99010043
I've tested the patch with SPECCPU2006. For the CINT2006 benchmarks,
Reference: o2
(1): auto_fdo
(2): fdo
Benchmark Base:Reference (1) (2)
-----------------------------------------------------------------
spec/2006/int/C++/471.omnetpp 23.18 +3.11% +5.09%
spec/2006/int/C++/473.astar 21.15 +6.79% +9.80%
spec/2006/int/C++/483.xalancbmk 36.68 +11.56% +14.47%
spec/2006/int/C/400.perlbench 34.57 +6.59% +18.56%
spec/2006/int/C/401.bzip2 23.17 +0.95% +2.49%
spec/2006/int/C/403.gcc 32.33 +8.27% +9.76%
spec/2006/int/C/429.mcf 42.13 +4.72% +5.23%
spec/2006/int/C/445.gobmk 26.53 -1.39% +0.05%
spec/2006/int/C/456.hmmer 23.72 +7.12% +7.87%
spec/2006/int/C/458.sjeng 26.17 +4.65% +6.04%
spec/2006/int/C/462.libquantum 57.23 +4.04% +1.42%
spec/2006/int/C/464.h264ref 46.3 +1.07% +8.97%
geometric mean +4.73% +7.36%
The majority of the performance difference between AutoFDO and FDO
comes from the lack of instruction level discriminator support. Cary
Coutant is planning to port that patch upstream too.
Please let me know if you have any question about this patch, and
thanks in advance for reviewing such a huge patch.
Dehao
Jan Hubicka
2014-05-15 21:04:30 UTC
Permalink
Post by Dehao Chen
Hi,
I'm planning to port the AutoFDO patch upstream. Attached is the
prepared patch. You can also find the patch in
http://codereview.appspot.com/99010043
I've tested the patch with SPECCPU2006. For the CINT2006 benchmarks,
Reference: o2
(1): auto_fdo
(2): fdo
Benchmark Base:Reference (1) (2)
-----------------------------------------------------------------
spec/2006/int/C++/471.omnetpp 23.18 +3.11% +5.09%
spec/2006/int/C++/473.astar 21.15 +6.79% +9.80%
spec/2006/int/C++/483.xalancbmk 36.68 +11.56% +14.47%
spec/2006/int/C/400.perlbench 34.57 +6.59% +18.56%
spec/2006/int/C/401.bzip2 23.17 +0.95% +2.49%
spec/2006/int/C/403.gcc 32.33 +8.27% +9.76%
spec/2006/int/C/429.mcf 42.13 +4.72% +5.23%
spec/2006/int/C/445.gobmk 26.53 -1.39% +0.05%
spec/2006/int/C/456.hmmer 23.72 +7.12% +7.87%
spec/2006/int/C/458.sjeng 26.17 +4.65% +6.04%
spec/2006/int/C/462.libquantum 57.23 +4.04% +1.42%
spec/2006/int/C/464.h264ref 46.3 +1.07% +8.97%
geometric mean +4.73% +7.36%
The results are very nice indeed. So basically it adds another 50% to
static estimation, that is not bad.

The patch is long and missing a changelog, I will add my comments and
I think it would be great to break it up where possible - for some
infrastructure preparation patches and the actual reading infrastructure.
Post by Dehao Chen
Index: gcc/cfghooks.c
===================================================================
--- gcc/cfghooks.c (revision 210180)
+++ gcc/cfghooks.c (working copy)
@@ -833,6 +833,9 @@ make_forwarder_block (basic_block bb, bool (*redir
fallthru = split_block_after_labels (bb);
dummy = fallthru->src;
+ dummy->count = 0;
+ dummy->frequency = 0;
+ fallthru->count = 0;
bb = fallthru->dest;
/* Redirect back edges we want to keep. */
@@ -842,20 +845,13 @@ make_forwarder_block (basic_block bb, bool (*redir
if (redirect_edge_p (e))
{
+ dummy->frequency += EDGE_FREQUENCY (e);
+ dummy->count += e->count;
+ fallthru->count += e->count;
ei_next (&ei);
continue;
}
- dummy->frequency -= EDGE_FREQUENCY (e);
- dummy->count -= e->count;
- if (dummy->frequency < 0)
- dummy->frequency = 0;
- if (dummy->count < 0)
- dummy->count = 0;
- fallthru->count -= e->count;
- if (fallthru->count < 0)
- fallthru->count = 0;
-
Can you elaborate why these changes are needed? They look unrelated,
so it is a bug in forwarder block profile updating?
I do not see why source of the edge needs to have profile cleaned.
Post by Dehao Chen
Index: gcc/loop-unroll.c
===================================================================
--- gcc/loop-unroll.c (revision 210180)
+++ gcc/loop-unroll.c (working copy)
@@ -998,6 +998,13 @@ decide_unroll_runtime_iterations (struct loop *loo
return;
}
+ /* In AutoFDO, the profile is not accurate. If the calculated trip count
+ is larger than the header count, then the profile is not accurate
+ enough to make correct unroll decisions. */
+ if (flag_auto_profile
+ && expected_loop_iterations (loop) > loop->header->count)
+ return;
This is another independent change. It does make sense, but I would
preffer to have it in a separate function (expected_loop_iterations_reliable_p)
rather than inline in the code.

There are other loop optimizations that are driven by this, and they should
consistently use this predicate.

In fact current -fprofile-use enabled -funroll-loops flag that enables all
unrolling independently of reliablility of the profile. I suppose we need to
imporove this scheme and have something liek unroll-loop-with-reliable-profile
path that would default for profile feedback optimization.

I also do not think your check makes a lot of sense. You want to compare loop
header count and loop preheader or so, or this is not reliable enough for
auto-FDO profiles?

In any case, lets handle this separately of the main autoFDO patches (and not
forget on it).
Post by Dehao Chen
Index: gcc/dwarf2out.c
===================================================================
--- gcc/dwarf2out.c (revision 210180)
+++ gcc/dwarf2out.c (working copy)
@@ -2481,6 +2481,40 @@ const struct gcc_debug_hooks dwarf2_debug_hooks =
1, /* start_end_main_source_file */
TYPE_SYMTAB_IS_DIE /* tree_type_symtab_field */
};
+
+const struct gcc_debug_hooks auto_profile_debug_hooks =
+{
+ debug_nothing_charstar,
+ debug_nothing_charstar,
+ debug_nothing_void,
+ debug_nothing_int_charstar,
+ debug_nothing_int_charstar,
+ debug_nothing_int_charstar,
+ debug_nothing_int,
+ debug_nothing_int_int, /* begin_block */
+ debug_nothing_int_int, /* end_block */
+ dwarf2out_ignore_block, /* ignore_block */
+ debug_nothing_int_charstar_int_bool, /* source_line */
+ debug_nothing_int_charstar, /* begin_prologue */
+ debug_nothing_int_charstar, /* end_prologue */
+ debug_nothing_int_charstar, /* begin_epilogue */
+ debug_nothing_int_charstar, /* end_epilogue */
+ debug_nothing_tree, /* begin_function */
+ debug_nothing_int, /* end_function */
+ debug_nothing_tree, /* function_decl */
+ debug_nothing_tree, /* global_decl */
+ debug_nothing_tree_int, /* type_decl */
+ debug_nothing_tree_tree_tree_bool, /* imported_module_or_decl */
+ debug_nothing_tree, /* deferred_inline_function */
+ debug_nothing_tree, /* outlining_inline_function */
+ debug_nothing_rtx, /* label */
+ debug_nothing_int, /* handle_pch */
+ debug_nothing_rtx, /* var_location */
+ debug_nothing_void, /* switch_text_section */
+ debug_nothing_tree_tree, /* set_name */
+ 0, /* start_end_main_source_file */
+ TYPE_SYMTAB_IS_ADDRESS /* tree_type_symtab_field */
+};
Note that I can not approve changes to debug machine (and do not feel confident
in it), so lets do this as independent change, too.
You need those to reliably output line info with autofdo enabled or why this
is needed at all?
Post by Dehao Chen
/* NOTE: In the comments in this file, many references are made to
"Debugging Information Entries". This term is abbreviated as `DIE'
@@ -16799,10 +16833,9 @@ add_src_coords_attributes (dw_die_ref die, tree de
static void
add_linkage_name (dw_die_ref die, tree decl)
{
- if (debug_info_level > DINFO_LEVEL_TERSE
+ if (debug_info_level > DINFO_LEVEL_NONE
&& (TREE_CODE (decl) == FUNCTION_DECL || TREE_CODE (decl) == VAR_DECL)
&& TREE_PUBLIC (decl)
- && !DECL_ABSTRACT (decl)
&& !(TREE_CODE (decl) == VAR_DECL && DECL_REGISTER (decl))
&& die->die_tag != DW_TAG_member)
{
/* Counters that are collected. */
Index: gcc/cfg-flags.def
===================================================================
--- gcc/cfg-flags.def (revision 210180)
+++ gcc/cfg-flags.def (working copy)
@@ -93,6 +93,9 @@ DEF_BASIC_BLOCK_FLAG(VISITED, 13)
demand, and is available after calling compute_transaction_bits(). */
DEF_BASIC_BLOCK_FLAG(IN_TRANSACTION, 14)
+/* Set on blocks that has been annotated during AutoFDO profile
+ attribution. */
+DEF_BASIC_BLOCK_FLAG(ANNOTATED, 15)
ANNOTATED is uninformative name, perhaps AUTOFDO_ANNOTATED?
What is the annotation used for?
Post by Dehao Chen
Index: gcc/ira-int.h
===================================================================
--- gcc/ira-int.h (revision 210180)
+++ gcc/ira-int.h (working copy)
@@ -41,9 +41,10 @@ along with GCC; see the file COPYING3. If not see
analogous to REG_FREQ_FROM_BB. When optimizing for size, or
profile driven feedback is available and the function is never
executed, frequency is always equivalent. Otherwise rescale the
- edge frequency. */
+ edge frequency. For AutoFDO, even if a function is not sampled,
+ it can still be executed, thus frequency rescale is still used. */
#define REG_FREQ_FROM_EDGE_FREQ(freq) \
- (optimize_size || (flag_branch_probabilities \
+ (optimize_size || (flag_branch_probabilities && !flag_auto_profile \
&& !ENTRY_BLOCK_PTR_FOR_FN (cfun)->count) \
? REG_FREQ_MAX : (freq * REG_FREQ_MAX / BB_FREQ_MAX) \
? (freq * REG_FREQ_MAX / BB_FREQ_MAX) : 1)
This also needs more consistent solution.
I think this macro is out of date and hsould test
optimize_function_for_size_p predicate and the predicate should do the right
thing for auto-FDO.

This change is pre-approved for mainline.
Post by Dehao Chen
Index: gcc/tree-profile.c
===================================================================
--- gcc/tree-profile.c (revision 210180)
+++ gcc/tree-profile.c (working copy)
@@ -696,7 +696,7 @@ bool
pass_ipa_tree_profile::gate (function *)
{
/* When profile instrumentation, use or test coverage shall be performed. */
- return (!in_lto_p
+ return (!in_lto_p && !flag_auto_profile
&& (flag_branch_probabilities || flag_test_coverage
|| profile_arc_flag));
Update comment here.
Post by Dehao Chen
}
Index: gcc/tree-inline.c
===================================================================
--- gcc/tree-inline.c (revision 210180)
+++ gcc/tree-inline.c (working copy)
@@ -1977,9 +1977,10 @@ copy_edges_for_bb (basic_block bb, gcov_type count
edge new_edge;
flags = old_edge->flags;
+ flags &= (~EDGE_ANNOTATED);
Probably comment on why you want to clear annotations.
Post by Dehao Chen
@@ -2191,6 +2192,9 @@ initialize_cfun (tree new_fndecl, tree callee_fnde
else
count_scale = REG_BR_PROB_BASE;
+ if (count_scale > REG_BR_PROB_BASE)
+ count_scale = REG_BR_PROB_BASE;
+
/* Register specific tree functions. */
gimple_register_cfg_hooks ();
@@ -2452,6 +2456,9 @@ copy_cfg_body (copy_body_data * id, gcov_type coun
else
count_scale = REG_BR_PROB_BASE;
+ if (count_scale > REG_BR_PROB_BASE)
+ count_scale = REG_BR_PROB_BASE;
+
/* Register specific tree functions. */
gimple_register_cfg_hooks ();
These changes also looks independent and should go in separately (with an explanation)
Post by Dehao Chen
Index: gcc/bb-reorder.c
===================================================================
--- gcc/bb-reorder.c (revision 210180)
+++ gcc/bb-reorder.c (working copy)
@@ -2663,7 +2663,7 @@ pass_partition_blocks::gate (function *fun)
user defined section attributes. Don't call it if either case
arises. */
return (flag_reorder_blocks_and_partition
- && optimize
+ && optimize &&!flag_auto_profile
Formatting error. Why we want !flag_auto_profile? Can't we just arrange
flag_reorder_blocks_and_partition to default to false?

In fact I would like to see flag_reorder_blocks_and_partition to offload obviously
cold code (in partiuclar the EH cleanups) even without profile. That should work
well with autofdo.
Post by Dehao Chen
Index: gcc/coverage.c
===================================================================
--- gcc/coverage.c (revision 210180)
+++ gcc/coverage.c (working copy)
@@ -54,6 +54,7 @@ along with GCC; see the file COPYING3. If not see
#include "intl.h"
#include "filenames.h"
#include "target.h"
+#include "auto-profile.h"
#include "gcov-io.h"
#include "gcov-io.c"
@@ -1175,7 +1176,9 @@ coverage_init (const char *filename)
bbg_file_stamp = local_tick;
- if (flag_branch_probabilities)
+ if (flag_auto_profile)
+ init_auto_profile ();
Perhaps initialization can happen from toplev.c instead?
Post by Dehao Chen
Index: gcc/profile.c
===================================================================
--- gcc/profile.c (revision 210180)
+++ gcc/profile.c (working copy)
@@ -106,6 +106,12 @@ static int total_num_times_called;
static int total_hist_br_prob[20];
static int total_num_branches;
+void add_working_set (gcov_working_set_t *set) {
+ int i = 0;
+ for (; i < NUM_GCOV_WORKING_SETS; i++)
+ gcov_working_sets[i] = set[i];
+}
Comment here.
Post by Dehao Chen
Index: gcc/regs.h
===================================================================
--- gcc/regs.h (revision 210180)
+++ gcc/regs.h (working copy)
@@ -137,6 +137,7 @@ extern size_t reg_info_p_size;
frequency. */
#define REG_FREQ_FROM_BB(bb) (optimize_size \
|| (flag_branch_probabilities \
+ && !flag_auto_profile \
&& !ENTRY_BLOCK_PTR_FOR_FN (cfun)->count) \
? REG_FREQ_MAX \
: ((bb)->frequency * REG_FREQ_MAX / BB_FREQ_MAX)\
Again, I think optimize_function_for_size is good predicate here.
Post by Dehao Chen
Index: gcc/common.opt
===================================================================
--- gcc/common.opt (revision 210180)
+++ gcc/common.opt (working copy)
@@ -878,6 +878,16 @@ fauto-inc-dec
Common Report Var(flag_auto_inc_dec) Init(1)
Generate auto-inc/dec instructions
+fauto-profile
+Common Report Var(flag_auto_profile) Optimization
+Use sample profile information for call graph node weights. The default
+profile file is fbdata.afdo in 'pwd'.
+
+fauto-profile=
+Common Joined RejectNegative Var(auto_profile_file)
+Use sample profile information for call graph node weights. The profile
+file is specified in the argument.
I believe auto-fdo will need a separate section in the manual describing the
whole machinery.
Post by Dehao Chen
Index: gcc/opts.c
===================================================================
--- gcc/opts.c (revision 210180)
+++ gcc/opts.c (working copy)
@@ -627,6 +627,11 @@ default_options_optimization (struct gcc_options *
default_param_value (PARAM_MIN_CROSSJUMP_INSNS),
opts->x_param_values, opts_set->x_param_values);
+ if (flag_auto_profile)
+ maybe_set_param_value
+ (PARAM_EARLY_INLINER_MAX_ITERATIONS, 10,
+ opts->x_param_values, opts_set->x_param_values);
+
/* Allow default optimizations to be specified on a per-machine basis. */
maybe_default_options (opts, opts_set,
targetm_common.option_optimization_table,
@@ -1720,6 +1725,56 @@ common_handle_option (struct gcc_options *opts,
opts->x_flag_devirtualize_speculatively = false;
break;
+ opts->x_auto_profile_file = xstrdup (arg);
+ opts->x_flag_auto_profile = true;
+ value = true;
+ /* No break here - do -fauto-profile processing. */
+ if (!opts_set->x_flag_branch_probabilities)
+ opts->x_flag_branch_probabilities = value;
+ if (!opts_set->x_flag_profile_values)
+ opts->x_flag_profile_values = value;
+ if (!opts_set->x_flag_unroll_loops)
+ opts->x_flag_unroll_loops = value;
+ if (!opts_set->x_flag_peel_loops)
+ opts->x_flag_peel_loops = value;
+ if (!opts_set->x_flag_tracer)
+ opts->x_flag_tracer = value;
+ if (!opts_set->x_flag_value_profile_transformations)
+ opts->x_flag_value_profile_transformations = value;
+ if (!opts_set->x_flag_inline_functions)
+ opts->x_flag_inline_functions = value;
+ if (!opts_set->x_flag_ipa_cp)
+ opts->x_flag_ipa_cp = value;
+ if (!opts_set->x_flag_ipa_cp_clone
+ && value && opts->x_flag_ipa_cp)
+ opts->x_flag_ipa_cp_clone = value;
+ if (!opts_set->x_flag_predictive_commoning)
+ opts->x_flag_predictive_commoning = value;
+ if (!opts_set->x_flag_unswitch_loops)
+ opts->x_flag_unswitch_loops = value;
+ if (!opts_set->x_flag_gcse_after_reload)
+ opts->x_flag_gcse_after_reload = value;
+ if (!opts_set->x_flag_tree_loop_vectorize
+ && !opts_set->x_flag_tree_vectorize)
+ opts->x_flag_tree_loop_vectorize = value;
+ if (!opts_set->x_flag_tree_slp_vectorize
+ && !opts_set->x_flag_tree_vectorize)
+ opts->x_flag_tree_slp_vectorize = value;
+ if (!opts_set->x_flag_vect_cost_model)
+ opts->x_flag_vect_cost_model = VECT_COST_MODEL_DYNAMIC;
+ if (!opts_set->x_flag_tree_loop_distribute_patterns)
+ opts->x_flag_tree_loop_distribute_patterns = value;
+ /* Indirect call profiling should do all useful transformations
+ speculative devirutalization does. */
+ if (!opts_set->x_flag_devirtualize_speculatively
+ && opts->x_flag_value_profile_transformations)
+ opts->x_flag_devirtualize_speculatively = false;
+ if (!opts_set->x_flag_profile_correction)
+ opts->x_flag_profile_correction = value;
+ break;
Instead of cut&paste from profile-use I would go for function setting the common flags.
Post by Dehao Chen
Index: gcc/tree-cfg.c
===================================================================
--- gcc/tree-cfg.c (revision 210180)
+++ gcc/tree-cfg.c (working copy)
@@ -1877,6 +1877,14 @@ gimple_merge_blocks (basic_block a, basic_block b)
}
}
+ /* When merging two BBs, if their counts are different, the larger
+ count is selected as the new bb count. */
+ if (flag_auto_profile && a->count != b->count)
+ {
+ a->count = MAX (a->count, b->count);
+ a->frequency = MAX (a->frequency, b->frequency);
+ }
Separate change and OK for mainline. (perhaps with comment update
that this if to handle better inconsistent profiles)
Post by Dehao Chen
+
/* Merge the sequences. */
last = gsi_last_bb (a);
gsi_insert_seq_after (&last, bb_seq (b), GSI_NEW_STMT);
Index: gcc/cgraphclones.c
===================================================================
--- gcc/cgraphclones.c (revision 210180)
+++ gcc/cgraphclones.c (working copy)
@@ -435,6 +435,11 @@ cgraph_clone_node (struct cgraph_node *n, tree dec
}
else
count_scale = 0;
+ /* In AutoFDO, if edge count is larger than callee's entry block
+ count, we will not update the original callee because it may
+ mistakenly mark some hot function as cold. */
+ if (flag_auto_profile && count >= n->count)
+ update_original = false;
This looks like a hack. Lets go with it later - I see what you are shooting for here
and non-autoFDO has kind of same problem, too. Counts are not at all that reliable
after some inlining.

I wonder how this hack would fare for -fprofile-use
Post by Dehao Chen
@@ -1286,6 +1285,9 @@ del_node_map (void)
struct cgraph_node*
find_func_by_profile_id (int profile_id)
{
+ if (flag_auto_profile)
+ return cgraph_node_for_asm (
+ get_identifier ((const char *)(size_t)profile_id));
I think with LTO we will need to make this machinery safe for symbol mangling. Any plans on this?
Post by Dehao Chen
@@ -1492,8 +1494,8 @@ gimple_ic_transform (gimple_stmt_iterator *gsi)
/* The order of CHECK_COUNTER calls is important -
since check_counter can correct the third parameter
and we want to make count <= all <= bb_all. */
- if ( check_counter (stmt, "ic", &all, &bb_all, bb_all)
- || check_counter (stmt, "ic", &count, &all, all))
+ if (!flag_auto_profile && (check_counter (stmt, "ic", &all, &bb_all, bb_all)
+ || check_counter (stmt, "ic", &count, &all, all)))
Why you skip check for auto profile here, but not for other transforms?
Similar check is now done at ipa-profile later. I suppose you may want to make autofdo
code to make the counters just look right when they are supposed to be right.
Post by Dehao Chen
Index: gcc/auto-profile.c
===================================================================
--- gcc/auto-profile.c (revision 0)
+++ gcc/auto-profile.c (revision 0)
@@ -0,0 +1,1584 @@
+/* Calculate branch probabilities, and basic block execution counts.
+ Copyright (C) 2012. Free Software Foundation, Inc.
I will look into this file incrementally. You want to update copyright year.
Post by Dehao Chen
Index: gcc/ipa-inline.c
===================================================================
--- gcc/ipa-inline.c (revision 210180)
+++ gcc/ipa-inline.c (working copy)
@@ -121,6 +121,7 @@ along with GCC; see the file COPYING3. If not see
#include "ipa-inline.h"
#include "ipa-utils.h"
#include "sreal.h"
+#include "auto-profile.h"
#include "cilk.h"
/* Statistics we collect about inlining algorithm. */
@@ -450,6 +451,8 @@ want_early_inline_function_p (struct cgraph_edge *
if (DECL_DISREGARD_INLINE_LIMITS (callee->decl))
;
+ else if (flag_auto_profile && afdo_callsite_hot_enough_for_early_inline (e))
+ ;
You want to explain what happens here (I remember it from presentation :)
Post by Dehao Chen
Index: gcc/ipa-inline.h
===================================================================
--- gcc/ipa-inline.h (revision 210180)
+++ gcc/ipa-inline.h (working copy)
@@ -345,3 +345,5 @@ reset_edge_growth_cache (struct cgraph_edge *edge)
edge_growth_cache[edge->uid] = zero;
}
}
+
+unsigned int early_inliner (function *fun);
Can't this be handled by scheduling early inliner as subpass of autofdo or something similar?
I would rather make this explicit in PM.

Honza
Dehao Chen
2014-10-09 00:24:21 UTC
Permalink
Hi, Honza,

Sorry for the delay. I just picked up the original patch, and updated
it with your comments.

I've addressed most of your comments. Something else to discuss inlined.

I had refactored the patch to make it much less intrusive. New patch
is attached (ChangeLog will be added in the final version of the
patch). The performance of the new patch is the same as the original
patch.

Thanks,
Dehao
Post by Jan Hubicka
Post by Dehao Chen
Index: gcc/bb-reorder.c
===================================================================
--- gcc/bb-reorder.c (revision 210180)
+++ gcc/bb-reorder.c (working copy)
@@ -2663,7 +2663,7 @@ pass_partition_blocks::gate (function *fun)
user defined section attributes. Don't call it if either case
arises. */
return (flag_reorder_blocks_and_partition
- && optimize
+ && optimize &&!flag_auto_profile
Formatting error. Why we want !flag_auto_profile? Can't we just arrange
flag_reorder_blocks_and_partition to default to false?
In fact I would like to see flag_reorder_blocks_and_partition to offload obviously
cold code (in partiuclar the EH cleanups) even without profile. That should work
well with autofdo.
This will cause bzip2 performance to degrade 6%. I haven't had time to
triage the problem. Will investigate this later.
Post by Jan Hubicka
Post by Dehao Chen
Index: gcc/coverage.c
===================================================================
--- gcc/coverage.c (revision 210180)
+++ gcc/coverage.c (working copy)
@@ -54,6 +54,7 @@ along with GCC; see the file COPYING3. If not see
#include "intl.h"
#include "filenames.h"
#include "target.h"
+#include "auto-profile.h"
#include "gcov-io.h"
#include "gcov-io.c"
@@ -1175,7 +1176,9 @@ coverage_init (const char *filename)
bbg_file_stamp = local_tick;
- if (flag_branch_probabilities)
+ if (flag_auto_profile)
+ init_auto_profile ();
Perhaps initialization can happen from toplev.c instead?
This is actually reading of afdo profile. I've renamed the function to
make it consistent.
Post by Jan Hubicka
Post by Dehao Chen
Index: gcc/cgraphclones.c
===================================================================
--- gcc/cgraphclones.c (revision 210180)
+++ gcc/cgraphclones.c (working copy)
@@ -435,6 +435,11 @@ cgraph_clone_node (struct cgraph_node *n, tree dec
}
else
count_scale = 0;
+ /* In AutoFDO, if edge count is larger than callee's entry block
+ count, we will not update the original callee because it may
+ mistakenly mark some hot function as cold. */
+ if (flag_auto_profile && count >= n->count)
+ update_original = false;
This looks like a hack. Lets go with it later - I see what you are shooting for here
and non-autoFDO has kind of same problem, too. Counts are not at all that reliable
after some inlining.
I wonder how this hack would fare for -fprofile-use
It does not affect -fprofile-use performance in my testing. However,
for AutoFDO, this happens much more often.
Post by Jan Hubicka
Post by Dehao Chen
@@ -1286,6 +1285,9 @@ del_node_map (void)
struct cgraph_node*
find_func_by_profile_id (int profile_id)
{
+ if (flag_auto_profile)
+ return cgraph_node_for_asm (
+ get_identifier ((const char *)(size_t)profile_id));
I think with LTO we will need to make this machinery safe for symbol mangling. Any plans on this?
No plan yet. Let's get AutoFDO in first, and spend some separate
effort to make AutoFDO+LTO work?
Post by Jan Hubicka
Post by Dehao Chen
Index: gcc/ipa-inline.h
===================================================================
--- gcc/ipa-inline.h (revision 210180)
+++ gcc/ipa-inline.h (working copy)
@@ -345,3 +345,5 @@ reset_edge_growth_cache (struct cgraph_edge *edge)
edge_growth_cache[edge->uid] = zero;
}
}
+
+unsigned int early_inliner (function *fun);
Can't this be handled by scheduling early inliner as subpass of autofdo or something similar?
I would rather make this explicit in PM.
See my comment below in auto-profile.c
Post by Jan Hubicka
Post by Dehao Chen
Index: gcc/auto-profile.c
===================================================================
--- gcc/auto-profile.c (revision 0)
+++ gcc/auto-profile.c (revision 0)
+ After the above 3 phases, all profile is readily annotated on the GCC IR.
+ AutoFDO tries to reuse all FDO infrastructure as much as possible to make
+ use of the profile. E.g. it uses existing mechanism to calculate the basic
+ block/edge frequency, as well as the cgraph node/edge count.
+*/
I suppose the phases can be made explicit to the PM, so early inliner can stay
a pass.
Unfortunately, this is more complicated:

while(changed) {
vpt();
einline();
annotate();
}
Post by Jan Hubicka
Post by Dehao Chen
+/* Store a string array, indexed by string position in the array. */
+class string_table {
+ static string_table *create ();
+
+ /* For a given string, returns its index. */
+ int get_index (const char *name) const;
+
+ /* For a given decl, returns the index of the decl name. */
+ int get_index_by_decl (tree decl) const;
+
+ /* For a given index, returns the string. */
+ const char *get_name (int index) const;
I wonder how these should work with LTO. I suppose we can tell user to not LTO
for train run, we should teach the table to ignore LTO generated suffixes as
well as random seeds (like coverage code already does).
What happens when two static functions have same name?
Shall we first focus on non-LTO version. And there could be separate
effort to make AutoFDO+LTO work?

For static functions, there will only be one copy in the profile. the
profile will be merged. I agree that this could potentially cause
problem. But OTOH, in a good practice, these functions should *not* be
hot functions. At least in our practice, we didn't see any hot
functions like this.
Post by Jan Hubicka
Post by Dehao Chen
+/* Profile for all functions. */
+class autofdo_source_profile {
+ static autofdo_source_profile *create ()
+ {
+ autofdo_source_profile *map = new autofdo_source_profile ();
I think we still want empty lines here. Otherwise I trust you on following GCC C++ coding standards :)
Post by Dehao Chen
+ if (map->read ())
+ return map;
+ delete map;
+ return NULL;
+ }
+/* Helper functions. */
+
+/* Return the original name of NAME: strip the suffix that starts
+ with '.' */
+
+static const char *get_original_name (const char *name)
+{
+ char *ret = xstrdup (name);
+ char *find = strchr (ret, '.');
+ if (find != NULL)
+ *find = 0;
+ return ret;
+}
Stripping all suffixes will likely make conflicts on static functions, right?
Example like this will get you a conflict
m()
{
void t(void)
{
}
t();
}
m2()
{
void t2(void)
{
}
t2();
}
that is of course not that important given that nested functions are not top priority
but it would be nice to handle it gratefuly.
I agree. My original intention is to make the function name handling
as simple as possible. The current impl is based on the asumption that
all important (hot) functions deserve a separate name (not just suffix
difference). We could definitely refine the function name handling in
the future patches.
Post by Jan Hubicka
I would definitely merge logic here with logic in gcov that already knows how to skip
random seeds and other cases that makes filenames to diverge...
Could you point me to the gcov logic that handles skipping function name suffix?
Post by Jan Hubicka
Post by Dehao Chen
+
+/* Return the combined location, which is a 32bit integer in which
+ higher 16 bits stores the line offset of LOC to the start lineno
+ of DECL, The lower 16 bits stores the discrimnator. */
+
+static unsigned
+get_combined_location (location_t loc, tree decl)
+{
+ return ((LOCATION_LINE (loc) - DECL_SOURCE_LINE (decl)) << 16);
+}
What happens on overflows here? I guess they are possible - 65536 lines is not infinity
these days.
If a function is longer than 65535 lines, then some lines may have
incorrect count attributed. But in pactice, we did not find any
function that is over 10K lines.
Post by Jan Hubicka
The discriminator is added later?
Yes, we need to update this when discrminator support is ported upstream.
Post by Jan Hubicka
Please review the file for the coding standards.
I'll do a final pass of the coding standard after all code logics are OK.
Post by Jan Hubicka
Post by Dehao Chen
+
+static void
+fake_read_autofdo_module_profile ()
Probably could get in only after LIPO?
Yes, this is only there to make sure that the profile format is
compatible with google branch. But after LIPO is upstream, we can add
the actual content here.
Post by Jan Hubicka
Post by Dehao Chen
+{
+ /* Read in the module info. */
+ gcov_read_unsigned ();
+
+ /* Skip the length of the section. */
+ gcov_read_unsigned ();
You probably want to check that length is 0 or so.
We already checked num_module is 0, whch is equivalent with checking
length is 0.
Post by Jan Hubicka
Post by Dehao Chen
+/* If a basic block's count is known, and only one of its in/out edges' count
+ is unknown, its count can be calculated.
+ Meanwhile, if all of the in/out edges' counts are known, then the basic
+ block's unknown count can also be calculated.
+ IS_SUCC is true if out edges of a basic blocks are examined.
+ Return TRUE if any basic block/edge count is changed. */
Interesting profile cleanup.
I wonder wy EDGE_ANNOTATED is exported now (i.e. why it is not maintained only withing
auto-FDO). Do you have more use for it outside auto-FDO?
I was thinking that having a flag about profile being reliable may be useful idea, but never
really tried to implement it.
Good point, removed the ANNOTATED flags.
Post by Jan Hubicka
Post by Dehao Chen
+
+/* Special propagation for circuit expressions. Because GCC translates
+ control flow into data flow for circuit expressions. E.g.
+ if (a && b)
+ BB2
+ else
+ BB3
+
+
+ if (a)
+ goto BB.t1
+ else
+ goto BB.t3
+ if (b)
+ goto BB.t2
+ else
+ goto BB.t3
+ goto BB.t3
+ tmp = PHI (0 (BB1), 0 (BB.t1), 1 (BB.t2)
+ if (tmp)
+ goto BB2
+ else
+ goto BB3
+
+ In this case, we need to propagate through PHI to determine the edge
+ count of BB1->BB.t1, BB.t1->BB.t2. */
So here you have known probabilities of BB1, BB2 and BB3 and you need to guess all the rest?
Why it need so much of gimple matching?
This is basically because some basic blocks have no stmt so that we
can not get its count from source level profile. However, we can get
edge count through PHI nodes.
Post by Jan Hubicka
Post by Dehao Chen
+ while (changed && i++ < 10)
+ {
+ changed = false;
+
+ if (afdo_propagate_edge (true))
+ changed = true;
+ if (afdo_propagate_edge (false))
+ changed = true;
+ afdo_propagate_circuit ();
+ }
I wonder if there is any code to share with same (except for the circuit thing) code in profile.c
and gcov.c? Profile.c also have the minimal flow algorithm for fixing broken profiles, perhaps
it could be useful here?
The initial SampleFDO implementation uses MCF algorithm to calculate
edge counts (the current mcf.c actually comes from that effort).
However, from our performance tuning effort, we found that MCF is an
overkill for AutoFDO purpose, and it's extremely slow and hard to
debug. Thus we implemented this ad-hoc heuristic for propagation.
Post by Jan Hubicka
Post by Dehao Chen
+ FOR_ALL_BB_FN (bb, cfun)
+ {
+ edge e;
+ edge_iterator ei;
+
+ FOR_EACH_EDGE (e, ei, bb->succs)
+ e->count =
+ (double) bb->count * e->probability / REG_BR_PROB_BASE;
+ bb->aux = NULL;
+ }
+
+ loop_optimizer_finalize ();
+ free_dominance_info (CDI_DOMINATORS);
+ free_dominance_info (CDI_POST_DOMINATORS);
+}
I wonder if using the estimated profile to fill in the missing gaps would make sense?
I.e. running the standard branch probabilities->freuqencies propagation code bug this
time taking the determined counts as known values?
Not sure if I follow. Could you elaborate?
Post by Jan Hubicka
Post by Dehao Chen
+
+/* Perform value profile transformation using AutoFDO profile. Add the
+ promoted stmts to PROMOTED_STMTS. Return TRUE if there is any
+ indirect call promoted. */
+
+static bool
+afdo_vpt_for_early_inline (stmt_set *promoted_stmts)
What exactly this function does? It inlines according to the inline decisions
made at train run time and recorded to profile feedback?
As replied above, we need iterative vpt-einline to make sure the IR
looks exactly the same as profiling binary before annotation.

Jan Hubicka
2014-05-16 00:46:50 UTC
Permalink
Post by Dehao Chen
Index: gcc/auto-profile.c
===================================================================
--- gcc/auto-profile.c (revision 0)
+++ gcc/auto-profile.c (revision 0)
@@ -0,0 +1,1584 @@
+/* Calculate branch probabilities, and basic block execution counts.
Update the toplevel comment, too.
Post by Dehao Chen
+
+/* The following routines implements AutoFDO optimization.
+
+ This optimization uses sampling profiles to annotate basic block counts
+ and uses heuristics to estimate branch probabilities.
+
+
+ Phase 1: Read profile from the profile data file.
+ * string_table: a map between function name and its index.
+ * autofdo_source_profile: a map from function_instance name to
+ function_instance. This is represented as a forest of
+ function_instances.
+ * WorkingSet: a histogram of how many instructions are covered for a
+ given percentage of total cycles.
+
+ Phase 2: Early inline.
+ * inlined in the profiled binary.
+ * callee body is hot in the profiling run.
+ If both condition satisfies, early inline will inline the callsite
+ regardless of the code growth.
+
+ Phase 3: Annotate control flow graph.
+ * Annotate basic block count
+ * Estimate branch probability
+
+ After the above 3 phases, all profile is readily annotated on the GCC IR.
+ AutoFDO tries to reuse all FDO infrastructure as much as possible to make
+ use of the profile. E.g. it uses existing mechanism to calculate the basic
+ block/edge frequency, as well as the cgraph node/edge count.
+*/
I suppose the phases can be made explicit to the PM, so early inliner can stay
a pass.
Post by Dehao Chen
+/* Store a string array, indexed by string position in the array. */
+class string_table {
+ static string_table *create ();
+
+ /* For a given string, returns its index. */
+ int get_index (const char *name) const;
+
+ /* For a given decl, returns the index of the decl name. */
+ int get_index_by_decl (tree decl) const;
+
+ /* For a given index, returns the string. */
+ const char *get_name (int index) const;
I wonder how these should work with LTO. I suppose we can tell user to not LTO
for train run, we should teach the table to ignore LTO generated suffixes as
well as random seeds (like coverage code already does).

What happens when two static functions have same name?
Post by Dehao Chen
+/* Profile for all functions. */
+class autofdo_source_profile {
+ static autofdo_source_profile *create ()
+ {
+ autofdo_source_profile *map = new autofdo_source_profile ();
I think we still want empty lines here. Otherwise I trust you on following GCC C++ coding standards :)
Post by Dehao Chen
+ if (map->read ())
+ return map;
+ delete map;
+ return NULL;
+ }
+/* Helper functions. */
+
+/* Return the original name of NAME: strip the suffix that starts
+ with '.' */
+
+static const char *get_original_name (const char *name)
+{
+ char *ret = xstrdup (name);
+ char *find = strchr (ret, '.');
+ if (find != NULL)
+ *find = 0;
+ return ret;
+}
Stripping all suffixes will likely make conflicts on static functions, right?
Example like this will get you a conflict
m()
{
void t(void)
{
}
t();
}
m2()
{
void t2(void)
{
}
t2();
}

that is of course not that important given that nested functions are not top priority
but it would be nice to handle it gratefuly.
I would definitely merge logic here with logic in gcov that already knows how to skip
random seeds and other cases that makes filenames to diverge...
Post by Dehao Chen
+
+/* Return the combined location, which is a 32bit integer in which
+ higher 16 bits stores the line offset of LOC to the start lineno
+ of DECL, The lower 16 bits stores the discrimnator. */
+
+static unsigned
+get_combined_location (location_t loc, tree decl)
+{
+ return ((LOCATION_LINE (loc) - DECL_SOURCE_LINE (decl)) << 16);
+}
What happens on overflows here? I guess they are possible - 65536 lines is not infinity
these days.
The discriminator is added later?
Post by Dehao Chen
+
+int
+string_table::get_index (const char *name) const
+{
+ if (name == NULL)
+ return -1;
+ string_index_map::const_iterator iter = map_.find (name);
+ if (iter == map_.end())
+ return -1;
+ else
+ return iter->second;
+}
+
+int
+string_table::get_index_by_decl (tree decl) const
Comments missing here.
Post by Dehao Chen
+
+/* Read the profile and create a function_instance with head count as
+ HEAD_COUNT. Recursively read callsites to create nested function_instances
+ too. STACK is used to track the recursive creation process. */
+
+function_instance *
+function_instance::read_function_instance (
+ function_instance_stack *stack, gcov_type head_count)
+{
+ unsigned name = gcov_read_unsigned ();
+ unsigned num_pos_counts = gcov_read_unsigned ();
+ unsigned num_callsites = gcov_read_unsigned ();
+ function_instance *s = new function_instance (name, head_count);
+ stack->push_back(s);
+
+ for (unsigned i = 0; i < num_pos_counts; i++)
+ {
+ unsigned offset = gcov_read_unsigned () & 0xffff0000;
+ unsigned num_targets = gcov_read_unsigned ();
+ gcov_type count = gcov_read_counter ();
+ s->pos_counts[offset].count = count;
+ for (unsigned j = 0; j < stack->size(); j++)
+ (*stack)[j]->total_count_ += count;
+ for (unsigned j = 0; j < num_targets; j++)
+ {
+ /* Only indirect call target histogram is supported now. */
+ gcov_read_unsigned ();
+ gcov_type target_idx = gcov_read_counter ();
+ s->pos_counts[offset].targets[target_idx] =
+ gcov_read_counter ();
+ }
+ }
+ for (unsigned i = 0; i < num_callsites; i++) {
+ unsigned offset = gcov_read_unsigned ();
+ function_instance *callee_function_instance =
+ read_function_instance (stack, 0);
+ s->callsites[std::make_pair (offset, callee_function_instance->name ())] =
+ callee_function_instance;
+ }
+ stack->pop_back();
+ return s;
Can you please add a summary of the file format and what is read in here?
Post by Dehao Chen
+/* Find count_info for a given gimple STMT. If found, store the count_info
+ in INFO and return true; otherwise return false. */
+
+bool
+autofdo_source_profile::get_count_info (gimple stmt, count_info *info) const
+{
+ if (LOCATION_LOCUS (gimple_location (stmt)) == cfun->function_end_locus)
+ return false;
+
+ inline_stack stack;
+ get_inline_stack (gimple_location (stmt), &stack);
+ if (stack.size () == 0)
+ return false;
+ function_instance *s = get_function_instance_by_inline_stack (stack);
+ if (s == NULL)
+ return false;
+ return s->get_count_info (stack[0].second, info);
+}
+
+/* Mark LOC as annotated. */
+
+void
+autofdo_source_profile::mark_annotated (location_t loc) {
Please review the file for the coding standards.
Post by Dehao Chen
+
+ /* Program behavior changed, original promoted (and inlined) target is not
+ hot any more. Will avoid promote the original target.
+
+ To check if original promoted target is still hot, we check the total
+ count of the unpromoted targets (stored in old_info). If it is no less
+ than half of the callsite count (stored in INFO), the original promoted
+ target is considered not hot any more. */
+ if (total >= info->count * 0.5)
Probably /2, I am not sure if we can hit roundoff errors with long long->double->long long,
but it is better to avoid it anyway.
Post by Dehao Chen
+/* Module profile is only used by LIPO. Here we simply ignore it. */
+
+static void
+fake_read_autofdo_module_profile ()
Probably could get in only after LIPO?
Post by Dehao Chen
+{
+ /* Read in the module info. */
+ gcov_read_unsigned ();
+
+ /* Skip the length of the section. */
+ gcov_read_unsigned ();
You probably want to check that length is 0 or so.
Post by Dehao Chen
+ for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+ {
+ count_info info;
+ gimple stmt = gsi_stmt (gsi);
+ if (stmt->code == GIMPLE_DEBUG)
+ continue;
Probably also clobbers can be skipped.
Post by Dehao Chen
+/* If a basic block's count is known, and only one of its in/out edges' count
+ is unknown, its count can be calculated.
+ Meanwhile, if all of the in/out edges' counts are known, then the basic
+ block's unknown count can also be calculated.
+ IS_SUCC is true if out edges of a basic blocks are examined.
+ Return TRUE if any basic block/edge count is changed. */
Interesting profile cleanup.
I wonder wy EDGE_ANNOTATED is exported now (i.e. why it is not maintained only withing
auto-FDO). Do you have more use for it outside auto-FDO?

I was thinking that having a flag about profile being reliable may be useful idea, but never
really tried to implement it.
Post by Dehao Chen
+
+/* Special propagation for circuit expressions. Because GCC translates
+ control flow into data flow for circuit expressions. E.g.
+ if (a && b)
+ BB2
+ else
+ BB3
+
+
+ if (a)
+ goto BB.t1
+ else
+ goto BB.t3
+ if (b)
+ goto BB.t2
+ else
+ goto BB.t3
+ goto BB.t3
+ tmp = PHI (0 (BB1), 0 (BB.t1), 1 (BB.t2)
+ if (tmp)
+ goto BB2
+ else
+ goto BB3
+
+ In this case, we need to propagate through PHI to determine the edge
+ count of BB1->BB.t1, BB.t1->BB.t2. */
So here you have known probabilities of BB1, BB2 and BB3 and you need to guess all the rest?
Why it need so much of gimple matching?
Post by Dehao Chen
+
+static void
+afdo_propagate_circuit (void)
+{
+ basic_block bb;
+ FOR_ALL_BB_FN (bb, cfun)
+ {
+ gimple phi_stmt;
+ tree cmp_rhs, cmp_lhs;
+ gimple cmp_stmt = last_stmt (bb);
+ edge e;
+ edge_iterator ei;
+
+ if (!cmp_stmt || gimple_code (cmp_stmt) != GIMPLE_COND)
+ continue;
+ cmp_rhs = gimple_cond_rhs (cmp_stmt);
+ cmp_lhs = gimple_cond_lhs (cmp_stmt);
+ if (!TREE_CONSTANT (cmp_rhs)
+ || !(integer_zerop (cmp_rhs) || integer_onep (cmp_rhs)))
+ continue;
+ if (TREE_CODE (cmp_lhs) != SSA_NAME)
+ continue;
+ if ((bb->flags & BB_ANNOTATED) == 0)
+ continue;
+ phi_stmt = SSA_NAME_DEF_STMT (cmp_lhs);
+ while (phi_stmt && gimple_code (phi_stmt) == GIMPLE_ASSIGN
+ && gimple_assign_single_p (phi_stmt)
+ && TREE_CODE (gimple_assign_rhs1 (phi_stmt)) == SSA_NAME)
+ phi_stmt = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (phi_stmt));
+ if (!phi_stmt || gimple_code (phi_stmt) != GIMPLE_PHI)
+ continue;
+ FOR_EACH_EDGE (e, ei, bb->succs)
+ {
+ unsigned i, total = 0;
+ edge only_one;
+ bool check_value_one = (((integer_onep (cmp_rhs))
+ ^ (gimple_cond_code (cmp_stmt) == EQ_EXPR))
+ ^ ((e->flags & EDGE_TRUE_VALUE) != 0));
+ if ((e->flags & EDGE_ANNOTATED) == 0)
+ continue;
+ for (i = 0; i < gimple_phi_num_args (phi_stmt); i++)
+ {
+ tree val = gimple_phi_arg_def (phi_stmt, i);
+ edge ep = gimple_phi_arg_edge (phi_stmt, i);
+
+ if (!TREE_CONSTANT (val) || !(integer_zerop (val)
+ || integer_onep (val)))
+ continue;
+ if (check_value_one ^ integer_onep (val))
+ continue;
+ total++;
+ only_one = ep;
+ if (e->probability == 0 && (ep->flags & EDGE_ANNOTATED) == 0)
+ {
+ ep->probability = 0;
+ ep->count = 0;
+ ep->flags |= EDGE_ANNOTATED;
+ }
+ }
+ if (total == 1 && (only_one->flags & EDGE_ANNOTATED) == 0)
+ {
+ only_one->probability = e->probability;
+ only_one->count = e->count;
+ only_one->flags |= EDGE_ANNOTATED;
+ }
+ }
+ }
+}
+
+/* Propagate the basic block count and edge count on the control flow
+ graph. We do the propagation iteratively until stablize. */
+
+static void
+afdo_propagate (void)
+{
+ basic_block bb;
+ bool changed = true;
+ int i = 0;
+
+ FOR_ALL_BB_FN (bb, cfun)
+ {
+ bb->count = ((basic_block) bb->aux)->count;
+ if ((((basic_block) bb->aux)->flags & BB_ANNOTATED) != 0)
+ bb->flags |= BB_ANNOTATED;
+ }
+
+ while (changed && i++ < 10)
+ {
+ changed = false;
+
+ if (afdo_propagate_edge (true))
+ changed = true;
+ if (afdo_propagate_edge (false))
+ changed = true;
+ afdo_propagate_circuit ();
+ }
I wonder if there is any code to share with same (except for the circuit thing) code in profile.c
and gcov.c? Profile.c also have the minimal flow algorithm for fixing broken profiles, perhaps
it could be useful here?
Post by Dehao Chen
+ afdo_find_equiv_class ();
+ afdo_propagate ();
+
+ FOR_EACH_BB_FN (bb, cfun)
+ {
+ edge e;
+ edge_iterator ei;
+ int num_unknown_succ = 0;
+ gcov_type total_count = 0;
+
+ FOR_EACH_EDGE (e, ei, bb->succs)
+ {
+ if ((e->flags & EDGE_ANNOTATED) == 0)
+ num_unknown_succ ++;
+ else
+ total_count += e->count;
+ }
+ if (num_unknown_succ == 0 && total_count > 0)
+ {
+ FOR_EACH_EDGE (e, ei, bb->succs)
+ e->probability =
+ (double) e->count * REG_BR_PROB_BASE / total_count;
+ }
+ }
+ FOR_ALL_BB_FN (bb, cfun)
+ {
+ edge e;
+ edge_iterator ei;
+
+ FOR_EACH_EDGE (e, ei, bb->succs)
+ e->count =
+ (double) bb->count * e->probability / REG_BR_PROB_BASE;
+ bb->aux = NULL;
+ }
+
+ loop_optimizer_finalize ();
+ free_dominance_info (CDI_DOMINATORS);
+ free_dominance_info (CDI_POST_DOMINATORS);
+}
I wonder if using the estimated profile to fill in the missing gaps would make sense?
I.e. running the standard branch probabilities->freuqencies propagation code bug this
time taking the determined counts as known values?
Post by Dehao Chen
+
+/* Perform value profile transformation using AutoFDO profile. Add the
+ promoted stmts to PROMOTED_STMTS. Return TRUE if there is any
+ indirect call promoted. */
+
+static bool
+afdo_vpt_for_early_inline (stmt_set *promoted_stmts)
What exactly this function does? It inlines according to the inline decisions
made at train run time and recorded to profile feedback?

The patch looks very resonable in general. Lets break it up to smaller chunks and
get it merged. It is an interesting stuff!

Thanks a lot!
Honza
Loading...