Added INTERVAL_FROM_CHI2

cschreib · cschreib · commit e9421a08dfcd · 2019-01-12T13:09:30.000Z
diff --git a/README.md b/README.md
@@ -295,6 +295,9 @@ To get the closest behavior to that of FAST-IDL, you should set ```C_INTERVAL=68
  * ```SAVE_SIM```: possible values are ```0``` or ```1```. The default is ```0```. If set to ```1```, the program will save the best fitting parameters in the Monte Carlo simulations of each galaxy in a FITS table located at ```best_fits/[catalog]_[source].sim.fits```. This table contains the values of all fitting parameters and the chi2. This will consume some more disk space, but will not slow down the program significantly. It can be useful to identify covariances and degeneracies that are not apparent from the confidence intervals printed in the output catalog.
  * ```BEST_FROM_SIM```: possible values are ```0``` or ```1```. The default is ```0```, and the best fitting solution will be chosen as the one providing the smallest chi2 value in the grid. If set to ```1```, the program will instead determine the best solution from the median of all the Monte Carlo simulations. This will ensure that the "best fit" values are more consistent with the confidence intervals (i.e., usually more centered), and erases large fluctuations when multiple solutions with very different fit parameters lead to very close chi2 values. This typically happens for galaxies with poor photometry: there are a large number of models which give similarly good chi2, but one of them has a chi2 better by a very small amount (say 0.001) and it thus picked as the "best fit".
 
+## Confidence intervals from chi2 grid
+ * ```INTERVAL_FROM_CHI2```: possible values are ```0``` or ```1```. The default is ```0```, in which case confidence intervals (error bars) are computed using a series of Monte Carlo simulations of the data. An alternative, much faster way to obtain similar results is to read off the range of parameter values covered by models in the grid, selecting only the models that have a chi2 within a certain threshold from the best value (see Avni 1976). This can be done by setting this value to ```1```. While in most cases the resulting confidence intervals will be almost the same as with the MC simulations, there are several advantages to using this approach. The first is that, by construction, the resulting confidence intervals will always include the best fit value; this is not guaranteed with MC simulation. The second is that this method behaves better when the best fit solution is close to the edge of the grid; intervals from MC simulations are computed from percentiles and therefore may under-represent edge values. The latter is particularly important for parameters that have a natural upper/lower bound, such as parameters that cannot be negative but for which zero is a perfectly fine value (e.g., Av, or some SFH parameters). There are some drawbacks however. To obtain precise confidence intervals, the grid must be relatively fine: if the true confidence interval of a parameter is 1 +/- 0.01, but the fitting grid for that parameter only has a step of 0.1, then the code will report 1 +/- 0, which is too optimistic. Finally, in terms of performance, for this option to work the program has to write part of the chi2 grid on the disk, and the amount of space required can be large if you have a very fine grid.
+
 ## Controlling the cache
  * ```NO_CACHE```: possible values are ```0``` or ```1```. The default is ```0```, and the program will read and/or create a cache file, storing the pre-computed model fluxes for reuse. If you are changing your grid often or if the grid is very large and you do not want to store it on the disk, you can set this value to ```1``` and the program will neither read from nor write to the cache. Because it avoids some IO operations, it may make the program faster when the grid has to be rebuilt.
 
@@ -320,7 +323,7 @@ To get the closest behavior to that of FAST-IDL, you should set ```C_INTERVAL=68
  * ```BEST_SFHS```: possible values are ```0``` or ```1```. The default is ```0```. If set to ```1```, the program will output the best fit star formation history (SFH) to a file, in the ```best_fits``` directory (as for the best fit SEDs). If Monte Carlo simulations are enabled, the program will also output confidence intervals on the SFH for each time step, as well as the median SFH among all Monte Carlo simulations. This median may not correspond to any analytical form allowed by your chosen SFH model.
  * ```SFH_OUTPUT_STEP```: possible values are any strictly positive number, which defines the size of a time step in the output SFH (in Myr). The default is ```10``` Myr.
  * ```SFH_OUTPUT```: possible values are ```'sfr'``` or ```'mass'```. The default is ```'sfr'```, and the program outputs as "SFH" the evolution of the instantaneous SFR of each galaxy with time. If set to ```'mass'```, the program will output instead the evolution of the stellar mass with time (which is usually better behaved, see Glazebrook et al. 2017). Note that the evolution of the mass accounts for mass loss, so the mass slowly _decreases_ with time after a galaxy has quenched.
- * ```SAVE_BESTCHI```: FAST++ can save the entire chi2 grid on the disk with the ```SAVE_CHI_GRID``` option. However, if you have *huge* grids, this can require too much disk space (I have been in situations where the chi2 grid would be as large as several TB!). Usually, one is not interested in the chi2 of *all* models, but only those that match the data within some tolerance threshold. This option allows you to only save on the disk the models that are worst than the best chi2 by some amount ```chi2 - best_chi2 < SAVE_BESTCHI``` (where ```SAVE_BESTCHI=1``` if you are interested in standard 68% confidence intervals, or ```2.71``` for 90% confidence, etc., see Avni 1976). These "good" models are saved in a separate ".grid" file for each galaxy of the input catalog, inside the ```best_chi2``` folder. The format is similar to the ".grid" file for the full chi2 grid (which is described above), but not identical. The ``fast++-grid2fits`` tool can also convert these files into FITS tables. The binary format is the following:
+ * ```SAVE_BESTCHI```: FAST++ can save the entire chi2 grid on the disk with the ```SAVE_CHI_GRID``` option. However, if you have *huge* grids, this can require too much disk space (I have been in situations where the chi2 grid would be as large as several TB!). Usually, one is not interested in the chi2 of *all* models, but only those that match the data within some tolerance threshold. This option allows you to only save on the disk the models that are worst than the best chi2 by some amount ```chi2 - best_chi2 < SAVE_BESTCHI``` (where ```SAVE_BESTCHI=1``` if you are interested in standard 68% confidence intervals, or ```2.71``` for 90% confidence, etc., see Avni 1976). If you set the ```INTERVAL_FROM_CHI2``` option, the program will use these saved grids to automatically compute parameter confidence intervals. The parameters of all these "good" models are saved in a separate ".grid" file for each galaxy of the input catalog, inside the ```best_chi2``` folder. The format is similar to the ".grid" file for the full chi2 grid (which is described above), but not identical. The ``fast++-grid2fits`` tool can also convert these files into FITS tables. The binary format is the following:
 ```
 # Begin header
 # ------------
diff --git a/example/fast.param b/example/fast.param
@@ -208,6 +208,11 @@ APPLY_VDISP    = 0          # km/s
 #   best-fit instead of the model with the smallest chi squared on
 #   the original (unperturbed) photometry.
 #
+# o INTERVAL_FROM_CHI2: use the chi2 grid directly to compute confidence
+#   intervals on the fit parameters, instead of using Monte Carlo
+#   simulation. This will force setting 'SAVE_BESTCHI' to a value large
+#   enough to encompass the chosen confidence intervals.
+#
 # o SAVE_SIM: save the best-fit parameters for each Monte Carlo
 #   simulation for all sources in the "best_fits" directory.
 #
@@ -272,6 +277,7 @@ N_SIM              = 100
 C_INTERVAL         = 68            # 68 / 95 / 99 or [68,95] etc
 BEST_FIT           = 0             # 0 / 1
 BEST_FROM_SIM      = 0             # 0 / 1
+INTERVAL_FROM_CHI2 = 0             # 0 / 1
 SAVE_SIM           = 0             # 0 / 1
 SFR_AVG            = 0             # 0, 100 Myr, 300 Myr etc
 INTRINSIC_BEST_FIT = 0             # 0 / 1
diff --git a/src/fast++-fitter.cpp b/src/fast++-fitter.cpp
@@ -315,12 +315,6 @@ void fitter_t::write_chi2(uint_t igrid, const vec1f& chi2, const vec2f& props, u
             if (chi2.safe[cis] < best_chi2.safe[is]) {
                 best_chi2.safe[is] = chi2.safe[cis];
 
-                struct datum {
-                    uint32_t id;
-                    float chi2;
-                    vec1f p;
-                };
-
                 // Read the saved data and write simultaneously
                 file::move(chi2_filename.safe[is], chi2_filename.safe[is]+".old");
                 in.open(chi2_filename.safe[is]+".old", std::ios::binary | std::ios::in);
@@ -723,6 +717,47 @@ vec2d make_grid_bins(const vec1d& grid) {
     return bins;
 }
 
+double get_chi2_from_conf_interval(double conf) {
+    if (1.0 - conf < 1e-6) return 24.0;
+
+    double eps = 1e-3;
+
+    double chi2 = 0.0;
+    double prev_chi2;
+    double delta = 1.0;
+    bool last_increase = true;
+
+    // This is a naive iterative inversion of the error function.
+    // It yields the corresponding chi2 value with a relative accuracy of 'eps'.
+    // Not the fastest implementation, but we don't care much about speed here.
+
+    do {
+        // Compute confidence interval for this chi2
+        double p = erf(sqrt(chi2/2.0));
+
+        // Move chi2
+        prev_chi2 = chi2;
+        if (p < conf) {
+            if (!last_increase) {
+                delta *= 0.5;
+                last_increase = true;
+            }
+
+            chi2 += delta;
+        } else {
+            if (last_increase) {
+                delta *= 0.5;
+                last_increase = false;
+            }
+
+            chi2 -= delta;
+        }
+
+    } while (abs(chi2/prev_chi2 - 1.0) > eps);
+
+    return chi2;
+}
+
 void fitter_t::find_best_fits() {
     if (opts.parallel == parallel_choice::models) {
         if (opts.verbose) note("waiting for all models to finish...");
@@ -743,6 +778,17 @@ void fitter_t::find_best_fits() {
         }
     }
 
+    vec1f delta_chi2;
+    if (opts.interval_from_chi2) {
+        for (auto& c : input.conf_interval) {
+            if (c < 0.5) {
+                delta_chi2.push_back(-get_chi2_from_conf_interval(1.0 - 2*c));
+            } else {
+                delta_chi2.push_back(+get_chi2_from_conf_interval(2*c - 1.0));
+            }
+        }
+    }
+
     if (opts.verbose) note("finding best fits...");
     for (uint_t is : range(input.id)) {
         if (!silence_invalid_chi2 && !is_finite(output.best_chi2[is])) {
@@ -759,6 +805,8 @@ void fitter_t::find_best_fits() {
             }
         }
 
+        // Deal with Monte Carlo Simulations
+
         if (opts.n_sim > 0) {
             vec1u bmodel = output.mc_best_model(is,_);
 
@@ -788,57 +836,100 @@ void fitter_t::find_best_fits() {
                 }
             }
 
-            // For grid parameters, use cumulative distribution
-            for (uint_t ip : range(gridder.nparam)) {
-                vec1d grid = sorted_grid[ip];
+            if (!opts.interval_from_chi2) {
+                // For grid parameters, use cumulative distribution
+                for (uint_t ip : range(gridder.nparam)) {
+                    vec1d grid = sorted_grid[ip];
 
-                if (grid.size() == 1) {
-                    if (opts.best_from_sim) {
-                        output.best_params(is,ip,0) = grid[0];
-                    }
+                    if (grid.size() == 1) {
+                        if (opts.best_from_sim) {
+                            output.best_params(is,ip,0) = grid[0];
+                        }
 
-                    for (uint_t ic : range(input.conf_interval)) {
-                        output.best_params(is,ip,1+ic) = grid[0];
+                        for (uint_t ic : range(input.conf_interval)) {
+                            output.best_params(is,ip,1+ic) = grid[0];
+                        }
+                    } else {
+                        // Build cumulative histogram of binned values
+                        vec2d bins = make_grid_bins(grid);
+                        vec1d hist = histogram(bparams.safe(ip,_), bins);
+                        vec1d cnt = cumul(hist);
+                        cnt /= cnt.back();
+
+                        // Treat the edges in a special way to avoid extrapolation beyond the grid
+                        prepend(cnt, {0.0});
+                        prepend(grid, {grid.front()});
+                        append(cnt, {1.0});
+                        append(grid, {grid.back()});
+
+                        // Compute percentiles by interpolating the cumulative PDF
+                        auto get_percentile = [&](double p) {
+                            return interpolate(grid, cnt, p);
+                        };
+
+                        if (opts.best_from_sim) {
+                            output.best_params(is,ip,0) = get_percentile(0.5);
+                        }
+
+                        for (uint_t ic : range(input.conf_interval)) {
+                            output.best_params(is,ip,1+ic) =
+                                get_percentile(input.conf_interval[ic]);
+                        }
                     }
-                } else {
-                    // Build cumulative histogram of binned values
-                    uint_t ng = grid.size();
-                    vec2d bins = make_grid_bins(grid);
-                    vec1d hist = histogram(bparams.safe(ip,_), bins);
-                    vec1d cnt = cumul(hist);
-                    cnt /= cnt.back();
-
-                    // Treat the edges in a special way to avoid extrapolation beyond the grid
-                    prepend(cnt, {0.0});
-                    prepend(grid, {grid.front()});
-                    append(cnt, {1.0});
-                    append(grid, {grid.back()});
-
-                    // Compute percentiles by interpolating the cumulative PDF
-                    auto get_percentile = [&](double p) {
-                        return interpolate(grid, cnt, p);
-                    };
+                }
+
+                // For properties, use percentiles
+                for (uint_t ip : range(gridder.nparam, bparams.dims[0])) {
+                    vec1d bp = bparams.safe(ip,_);
 
                     if (opts.best_from_sim) {
-                        output.best_params(is,ip,0) = get_percentile(0.5);
+                        output.best_params(is,ip,0) = inplace_median(bp);
                     }
 
                     for (uint_t ic : range(input.conf_interval)) {
-                        output.best_params(is,ip,1+ic) = get_percentile(input.conf_interval[ic]);
+                        output.best_params(is,ip,1+ic) =
+                            inplace_percentile(bp, input.conf_interval[ic]);
                     }
                 }
             }
+        }
 
-            // For properties, use percentiles
-            for (uint_t ip : range(gridder.nparam, bparams.dims[0])) {
-                vec1d bp = bparams.safe(ip,_);
+        // If asked, obtain confidence intervals from chi2 grid saved on disk
 
-                if (opts.best_from_sim) {
-                    output.best_params(is,ip,0) = inplace_median(bp);
-                }
+        if (opts.interval_from_chi2) {
+            std::ifstream in(chi2_filename[is], std::ios::binary);
+            in.seekg(obchi2.hpos);
 
-                for (uint_t ic : range(input.conf_interval)) {
-                    output.best_params(is,ip,1+ic) = inplace_percentile(bp, input.conf_interval[ic]);
+            while (in) {
+                uint32_t id;
+                float chi2;
+                vec1f p(gridder.nprop);
+
+                if (file::read(in, id) && file::read(in, chi2) && file::read(in, p)) {
+                    vec1u ids = gridder.grid_ids(id);
+                    for (uint_t ip : range(gridder.nparam+gridder.nprop)) {
+                        double v;
+                        if (ip < gridder.nparam) {
+                            v = output.grid[ip][ids[ip]];
+                        } else {
+                            v = p[ip - gridder.nparam];
+                        }
+
+                        for (uint_t ic : range(input.conf_interval)) {
+                            if (chi2 - best_chi2.safe[is] < abs(delta_chi2[ic])) {
+                                float& saved = output.best_params(is,ip,1+ic);
+                                if (delta_chi2[ic] < 0.0) {
+                                    if (saved > v || !is_finite(saved)) {
+                                        saved = v;
+                                    }
+                                } else {
+                                    if (saved < v || !is_finite(saved)) {
+                                        saved = v;
+                                    }
+                                }
+                            }
+                        }
+                    }
                 }
             }
         }
diff --git a/src/fast++-read_input.cpp b/src/fast++-read_input.cpp
@@ -164,6 +164,7 @@ bool read_params(options_t& opts, input_state_t& state, const std::string& filen
         PARSE_OPTION(rest_mag)
         PARSE_OPTION(continuum_indices)
         PARSE_OPTION(sfh_quantities)
+        PARSE_OPTION(interval_from_chi2)
 
         #undef  PARSE_OPTION
         #undef  PARSE_OPTION_RENAME
@@ -321,6 +322,21 @@ bool read_params(options_t& opts, input_state_t& state, const std::string& filen
         return false;
     }
 
+    if (opts.interval_from_chi2) {
+        double max_interval = max(opts.c_interval);
+        double max_chi2 = get_chi2_from_conf_interval(max_interval/100.0);
+        if (max_chi2 > opts.save_bestchi) {
+            error("with 'INTERVAL_FROM_CHI2=1', ", max_interval, "% confidence interval "
+                "requires 'SAVE_BESTCHI>=", max_chi2, "'");
+            return false;
+        }
+    }
+
+    if (!opts.best_from_sim && opts.interval_from_chi2 && !opts.save_sim && opts.n_sim != 0) {
+        warning("with the current setup, Monte Carlo simulations are not used; setting 'N_SIM=0'");
+        opts.n_sim = 0;
+    }
+
     if (!opts.my_sfh.empty()) {
         opts.sfh = sfh_type::single;
     } else if (!opts.custom_sfh.empty()) {
@@ -355,7 +371,7 @@ bool read_params(options_t& opts, input_state_t& state, const std::string& filen
         }
     }
 
-    if (opts.n_sim != 0) {
+    if (opts.n_sim != 0 || opts.interval_from_chi2) {
         state.conf_interval = 0.5*(1.0 - opts.c_interval/100.0);
         inplace_sort(opts.c_interval);
         vec1f cint = state.conf_interval;
diff --git a/src/fast++.hpp b/src/fast++.hpp
@@ -121,6 +121,7 @@ struct options_t {
     float save_bestchi = 0.0;
     vec1u rest_mag;
     std::string continuum_indices;
+    bool interval_from_chi2 = false;
 
     // Custom SFH
     std::string custom_sfh;
@@ -538,4 +539,7 @@ namespace file {
 }
 }
 
+// defined in fast++-fitter.cpp
+double get_chi2_from_conf_interval(double conf);
+
 #endif

Original file line number	Diff line number	Diff line change
`@@ -121,6 +121,7 @@ struct options_t {`
`121`	`121`	`float save_bestchi = 0.0;`
`122`	`122`	`vec1u rest_mag;`
`123`	`123`	`std::string continuum_indices;`
	`124`	`+ bool interval_from_chi2 = false;`
`124`	`125`
`125`	`126`	`// Custom SFH`
`126`	`127`	`std::string custom_sfh;`
`@@ -538,4 +539,7 @@ namespace file {`
`538`	`539`	`}`
`539`	`540`	`}`
`540`	`541`
	`542`	`+// defined in fast++-fitter.cpp`
	`543`	`+double get_chi2_from_conf_interval(double conf);`
	`544`	`+`
`541`	`545`	`#endif`