Multi-model evaluation of phenology prediction for wheat in Australia

Predicting wheat phenology is important for cultivar selection, for effective crop management and provides a baseline for evaluating the effects of global change. Evaluating how well crop phenology can be predicted is therefore of major interest. Twenty-eight wheat modeling groups participated in this evaluation. Our target population was wheat fields in the major wheat growing regions of Australia under current climatic conditions and with current local management practices. The environments used for calibration and for evaluation were both sampled from this same target population. The calibration and evaluation environments had neither sites nor years in common, so this is a rigorous evaluation of the ability of modeling groups to predict phenology for new sites and weather conditions. Mean absolute error (MAE) for the evaluation environments, averaged over predictions of three phenological stages and over modeling groups, was 9 days, with a range from 6 to 20 days. Predictions using the multi-modeling group mean and median had prediction errors nearly as small as the best modeling group. About two thirds of the modeling groups performed better than a simple but relevant benchmark, which predicts phenology by assuming a constant temperature sum for each development stage. The added complexity of crop models beyond just the effect of temperature was thus justified in most cases. There was substantial variability between modeling groups using the same model structure, which implies that model improvement could be achieved not only by improving model structure, but also by improving parameter values, and in particular by improving calibration techniques.


78
Crop phenology describes the cycle of biological events during plant growth. These 79 events include, for example, seedling emergence, leaf appearance, flowering, and maturity. 80 Timing of growing seasons and their critical phases as well as estimates of them are increasingly 81 important in changing climate (Olesen et al., 2012, Dalhaus et al., 2018. Matching the 82 phenology of crop varieties to the climate in which they grow is critical for viable crop 83 production strategies (Rezaei et al., 2018. Furthermore, accurate simulation 84 of phenology is essential for models which simulate plant growth and yield (Archontoulis et 85 al., 2014;Boote et al., 2010Boote et al., , 2008. 86 In this study we focus on wheat phenology in Australia. Australia was the world's ninth 87 largest producer of wheat in 2018 and the sixth largest exporter (Workman, 2020). Crop model 88 predictions of phenology have been used in various studies related to wheat production in 89 Australia. In a study by Luo et al. (2018), the APSIM model was used to simulate changes in 90 phenology, water use efficiency, and yield to be expected from global climate change. The 91 APSIM model was used to evaluate changes in wheat phenology in Australia as a result of 92 warming temperatures in recent decades (Sadras and Monzon, 2006). That model was also used 93 to determine the flowering date at each location associated with highest average yield (Flohr et 94 al., 2017). 95 Given the interest in using crop models to predict phenology, it is important to evaluate 96 those predictions. How well can wheat phenology be predicted? In trying to answer this 97 question, one must first define exactly what aspect of the models is being evaluated, and then 98 must choose an appropriate methodology for carrying out the evaluation. 99

Location of calibration (red circles) and evaluation (blue triangles) sites across the 169
Australian cropping zones (shaded area; Source: Teluguntla et al., 2018). 170 Plots were visited regularly (about every two weeks) starting soon after emergence of 171 the early sowing and ending after crop maturity, and the Zadoks growth stage (Zadoks et al., 172 1974), on a scale from 1-100, was determined. Overall, there were 709 combinations of 173 environment and measurement date, with an average of 10.7 stage notations per environment. 174 The stages to be predicted here are stage Z30 (Zadoks stage 30,pseudostem,i.e. youngest leaf 175 sheath erection), stage Z65 (Zadoks stage 65, anthesis half-way, i.e. anthers occurring half way 176 to tip and base of ear), and stage Z90 (Zadoks stage 90, grain hard, difficult to divide). These 177 stages are often used for management decisions or to characterize phenology. 178 In preparing the data for the simulation study, a linear interpolation was performed 179 between each pair of stages, to give the date for every integer Zadoks stage from the first to the 180 last observed stage. If observed Zadoks stage for one date was larger than the observed Zadoks 181 stage for the next measurement date, both stages were replaced by the average of the two Zadoks 182 stages before interpolation. The interpolated values were provided in order to avoid different 9 modeling groups using different methods for interpolating the data, which would have added 184 additional uncertainty unrelated to the model performance. 185 The average standard deviation of observed Zadoks stages based on the replicates was 186 0.93 days. The standard deviation of interpolated days after sowing to Z30, Z65, and Z90 was 187 calculated using a bootstrap. For a day with r replicates, a sample of size r was obtained by 188 drawing values at random with replacement, independently for each measurement date. Then 189 the Zadoks values were interpolated as for the original data. This was done 1000 times, giving 190 standard deviations of 1.8 days for observed days to Z30, 0.9 days for observed days to Z65, 191 and 0.5 days for observed days to Z90, respectively. 192 Part of the data was provided to the modeling groups for calibration , and part was never 193 revealed to participants and used for evaluation . The calibration data originated from four sites, 194 two years, and three planting dates, so overall 24 environments. The evaluation data were from 195 six sites, one year, and three planting dates for a total of 18 environments (Table 1). The data 196 were divided in such a way that the calibration and evaluation data had neither sites nor years 197 in common. 198 Table 1 199 To characterize the environments, we calculated for each environment the average 205 temperature from sowing to Z30, Z65, and Z90, the average photoperiod from Z30 to Z65 using 206 the daylength function in the R package insol (Corripio, 2019.;R Core Team, 2017) and days 207 to full vernalization using the model in van Bussel et al. (2015) with the value Vsat = 25 days, 208 estimated from the figure in their paper. Figure 2 shows the range of average temperature, day 209 length, and days to vernalization for the calibration and evaluation environments as well as the 210 range of observed calendar days to Z30, Z65, and Z90. The range of values for the evaluation 211 data is always within the range of the calibration data, with the single exception of photoperiod. 212

Sites and sowing dates for calibration (underlined) and evaluation (bold). Note that
While the median and maximum day lengths were very similar for the two sets of environments, 213 the shortest day length was 11.5 hours among calibration environments, while among the 214 evaluation environments the shortest day length was 10.1 hours. 215 Twenty-eight different modeling groups participated in this study, where modeling 228 group refers to the group of people conducting the modeling exercise. Each modeling group is 229 associated with some specific model structure (some specific named model) and also with some 230 specific parameter values. The model structures involved are presented in Supplementary Table  231 S1. Models were considered to have the same structure even if the version number was different, 232 because version differences are expected to be negligible for phenology. Three of the model 233 structures were used by more than one group. Since different groups using the same structure 234 obtained different results, identifying the contributions by the name of the model would be 235 misleading. Furthermore, the performance of specific groups was not of major interest here. 236 Therefore the modeling groups were anonymized, and only identified by a number. There is no 237 model M5 because that group dropped out in the course of the study. The model structures used 238 by more than one group are noted S1 (three groups), S2 (three groups) and S3 (two groups). 239 Details about the way phenology is modeled by each model structure can be found in 240 the references for each model (Supplementary Table S1). Here we give only a brief overview. 241 The principal factors that affect winter wheat developmental rate are temperature, day length 242 and degree of vernalization (Johnen et al., 2012). Most, but not all, model structures take into 243 account all three factors. The simplest approach to modeling the effect of temperature is to 244 assume that development rate increases linearly with daily average temperature above some 245 base temperature (a parameter). In other models the rate may be constant above some optimal 246 temperature (a parameter), development rate may decline above the optimum temperature at 247 some rate (a parameter), or development rate may be some more complex function of 248 temperature (Kumudini et al., 2014;Wang et al., 2017) (Kumudini et al., 2014;Wang et al., 249 2017). The parameters of the temperature response curve may differ depending on development 250 stage. The effect of photoperiod on development rate is often modeled as a multiplier that is a 251 15 piecewise linear function of photoperiod. The function increases with some slope (a parameter) 252 up to a threshold photoperiod (a parameter), and then is 1 for photoperiods longer than the 253 threshold. Vernalization, which must be accomplished before the plant can flower, requires a 254 period of cold temperatures. Vernalization parameters can include the upper limit for 255 temperature to count as vernalizing, and the required number of vernalizing days. Some models 256 also relate development to the rate of leaf appearance (called the phyllochron, a parameter) or 257 rate of tillering. Finally, several models also take into account the effect of cold or drought 258 stress on development rate. If drought stress is taken into account, then development rate is 259 related to all the processes that determine soil moisture and plant water uptake. 260 The multi-model ensemble here was an "ensemble of opportunity" meaning that any 261 modeling group that asked to join was accepted. The activity was announced on the list server 262 of the Agricultural Modeling Inter-comparison and Improvement Project (AgMIP) and on the 263 list servers of several models. In addition to the original models, we defined two ensemble 264 models. The model e-mean has predictions equal to the mean of the simulated values.
where y is the average of the observed values. 288 We considered two skill measures. A skill measure compares prediction error of the 289 modeling group to be evaluated with the error of a simple model used for comparison. We 290 define two simple models, and therefore two skill measures. Both use MSE, rather than MAE, 291 as the measure of model error, in keeping with usual practice. The first simple model, noted 292 "naive", predicts that days to each stage will be equal to the average number of days to that 293 stage in the calibration data. The predictions of the naïve model here are 77.1, 123.1, and 166.5 294 days from sowing to stages Z30, Z65, and Z90, respectively. The first skill measure, modeling 295 efficiency (EF), is defined as 296 The naive model ignores all variability and predicts that days to any stage will be the same 298 regardless of the environment. A model with EF ≤ 0 is a model that does no better than the 299 naive model, and so would be considered a very poor predictor. A perfect model, with no error, 300 has modeling efficiency of 1. Often modeling efficiency is based on the fit of a calibrated model 301 to the data used for calibration (McCuen et al., 2006). Here, in contrast, the naïve model is 302 based on calibration data and used to predict for independent data. 303 The naïve model is a very low baseline for evaluating a crop model. We therefore 304 introduce a more realistic, but still simple model which takes into account the effect of 305 temperature on phenology. This "onlyT" model predicts that degree days (°D) from sowing to 306 each stage will be equal to the number of degree days from sowing to that stage in the calibration 307 data, where degree days on any calendar day is equal to average temperature that day. The 308 predictions of the onlyT model are that Z30 will occur 893.7 °D after sowing, Z65 will occur 309 1476.0 °D after sowing, and Z90 will occur 2245.7 °D after sowing. The second skill measure, 310 noted skillT, is then 311 is MSE for the onlyT model. As for any skill measure, a perfect model has 313 skillT = 1 and a model that does no better than the onlyT model has skillT ≤ 0 314 18 2.5 Sources of variability 315 A major interest of ensemble studies is that they provide information on the variability 316 in simulation results between different modeling groups. This variability can arise from 317 differences in model structure between different modeling groups or differences in parameter 318 values for groups that use the same model structure. In this study, three of the model structures 319 are used by more than one modeling group. This makes it possible to estimate separately the 320 variance in simulated values due to structure and the variance due to modeling group nested 321 within structure (i.e. due to differences in parameter values). We treat the simulated values as a 322 sample from the distribution of plausible model structures and plausible parameter values. 323 According to the law of total variance (Casella and Berger, 1990), the total variance of 324 simulated values can be decomposed into two parts as 325 where ŷ are the simulated values, S is model structure, E is the expectation, var is the variance, 327 and the notation |S means that the expectation (in the first term on the right hand side) or the 328 variance (in the second term on the right hand side) is taken separately for each value of model 329 structure. We estimated the first term by first calculating the average simulated value for each 330 structure (if a structure is represented by a single modeling group, this is just the value simulated 331 by that group), and then calculating the variance of those average values. This is the between-332 structure variability. To estimate the second term, we first calculated the variance between 333 simulated values for each of the three structures with multiple groups. Then we calculated the 334 average of those variances. This is the within-structure variability (i.e. variability due to 335 parameters). 336 19

Prediction error and skill
338 MAE values for the evaluation data are shown in Figure 3 and summarized in Table 2. 339 Results for individual modeling groups are given in Supplementary Table S2. Median MAE  340 values (and ranges) were 12 days (8-25 days) for days to Z30, 10 days (5-24 days) for days to 341 Z65, and 7 days (1-22 days) for days to Z90. The median (and range) of MAE averaged over 342 the three stages was 9 days (6-20 days). The ensemble predictors e-mean and e-median both 343 had averaged MAE values of 7 days. They were both only marginally worse than the best two 344 individual modeling groups, and e-median was marginally better than e-mean. For comparison 345 with other studies, we also report other criteria of error in Table 2. 346  There was substantial variability between modeling groups for each individual 383 prediction, including between modeling groups that share the same model structure 384 (Supplementary Figure S1). Averaged over the evaluation environments and over all three 385 stages Z30, Z65, and Z90, the estimated within-structure standard deviation was 4.3 days and 386 the estimated between-structure standard deviation was 11.9 days, so the within-structure 387 standard deviation was 36 % as large as the between-structure standard deviation. 388 389 4. Discussion The calibration and evaluation environments were drawn from the same target 392 population, namely wheat crops in the major wheat growing regions in Australia, with current 393 23 climate and local management practices. We compared the calibration and evaluation 394 environments for the main characteristics that are likely to affect phenology, namely 395 temperature, day length, and accumulation of vernalizing temperatures. Temperatures and 396 vernalizing durations of the evaluation environments were within the ranges of the calibration 397 environments, but the evaluation data had a larger range of day lengths than the calibration data. 398 This is the result of sampling variability, and may have led to larger prediction errors than if 399 the calibration data had a range of day lengths comparable to that of the evaluation data. 400 However, the range of days to each phenology stage for the evaluation data was always within 401 the range for the calibration data. We conclude that this study represents a case where the 402 calibration and evaluation data represent a similar range of conditions (with the caveat just 403 mentioned concerning photoperiod). This type of situation is of particular importance, for 404 example, where one wants to calibrate a crop model using current conditions and subsequently 405 test possible sowing dates within a limited range, or to compare phenology of multiple potential 406 cultivars at specific sites within the calibration domain. 407

408
The evaluation here was based on data which had neither sites nor years in common 409 with the calibration data. This was thus a rigorous estimate of how well crop modeling groups 410 can predict wheat phenology for unseen sites and weather, when provided with calibration data 411 sampled from the target population. The median MAE among models averaged over phenology 412 stages was 9 days, which was substantially larger than the standard deviation of observed stages, 413 which was in the range 1-2 days. The best modeling group had an average MAE of 7 days, 414 which was still substantially larger than the standard deviation of observed stages. MAE values 415 were significantly larger for prediction of days to Z30 than for prediction of days to later Zadoks 416 stages. This may be due to the large variability between groups in predicting time to emergence, 417 which is discussed in more detail below. Time to emergence is a major part of the time to Z30, 418 but a smaller fraction of time to Z65 or Z90. 419 Chauhan et al. (2019) reported a value of NRMSE of 0.062 for prediction of time to 420 flowering of wheat in Australia, for a version of APSIM taking the effect of water stress on 421 phenology into account. In that study, the model was adjusted to some extent to the data used 422 for evaluation, so the reported error probably underestimates the error for new environments. 423 That reported value was in any case within the range of NRMSE values found for different 424 modeling groups here, for both the evaluation data (NRMSE here from 0.056 to 0.227) and the 425 calibration data (NRMSE here from 0.041 to 0.197). Asseng et al. (2008), using the APSIM 426 model, found RMSE of 4 days for wheat phenology predictions (mostly predictions of days to 427 anthesis) for 44 different environments in Western Australia, a level of error which was smaller 428 than the minimum RMSE of 9 days found here for the evaluation data, and even smaller than 429 the minimum RMSE of 6 days found here for the calibration data. In that study, the phenology 430 model was again adjusted to some extent to the data (S. Asseng, 2020, pers. comm.), which 431 could explain the smaller errors. 432 The above comparisons suggest that prediction errors are very roughly similar between 433 studies, but that there are differences depending on the details of the prediction problem and 434 the way prediction error is evaluated. It is clearly useful to build up a knowledge base 435 concerning phenology prediction error, as a baseline for comparison for future studies or even 436 as a default value if evaluation is not done. Contributions to the knowledge base will be all the 437 more useful, to the extent that the details of the prediction problem are clearly specified 438 (including whether it is of type interpolation or extrapolation and including a characterization 439 of the target population) and to the extent that the evaluation has a rigorous separation between 440 the predictor and the evaluation data. The present study should therefore be a valuable 441 contribution to such a knowledge base. 442 It is of interest to compare the results here with those from a study structured like the 443 present study (calibration and evaluation environments with similar characteristics, evaluation 444 data not used for model development or tuning), but where the evaluation concerned prediction 445 of wheat phenology in France (Wallach et al., 2019). To a large extent, the same modeling 446 groups participated in both studies. Specifically, the French study included 27 different 447 modeling groups, 26 of which participated in the present study. A comparison between the two 448 studies for those 26 groups is an indication of the variability of prediction errors between target 449 populations for the same modeling groups. 450 MAE averaged over the evaluation environments and over predicted stages ranged from 451 3 to 13 days (median 6 days) for the French data compared to 6 to 20 days (median 9 days) for 452 the Australian data. The target population (wheat fields in Australia versus wheat fields in 453 France) thus had a substantial effect on prediction errors. A detailed analysis of the underlying 454 reasons for the larger errors in Australia is beyond the scope of this study. However, one 455 possible contributing cause is the simulation of time to emergence. The average simulated time 456 to emergence for all French environments was 10 days after sowing, and the mean standard 457 deviation between modeling groups was 4 days. The corresponding values for the Australian 458 environments were a mean emergence time of 15 days after sowing, and a mean standard 459 deviation between modeling groups of 18 days. This very large standard deviation for the 460 Australian environments, pointing at major differences between modeling groups, may be due 461 to dry conditions in some environments and the uncertainty regarding initial soil conditions, 462 leading some models to simulate very long times to emergence (up to 107 days, Supplementary 463 Figure S1). This suggests that for Australian environments, it would be valuable to have 464 observations of time to emergence for calibration. It seems that for many modeling groups, it 465 would be worthwhile to revisit the predictions of time to emergence under conditions like those 466 of the Australian environments, taking advantage of specific modeling studies of time to 467 emergence for wheat (Lindstrom et al., 1976;Wang et al., 2009). 468 An important question in modeling is whether the same modeling groups perform best 469 for all target populations, or whether different groups are best for different target populations. 470 There is quite a bit of scatter in the graph of MAE for the Australian versus French environments 471 ( Supplementary Fig. S2), but the rank correlation between the two (Kendall's tau) is 0.31, which 472 is statistically significant (p=0.013). This suggests that there are modeling groups which 473 perform better than others over a wide range of environments. Once again, it is prudent to repeat 474 that this applies to the case where calibration is based on environments that are sampled from 475 the target distribution. Prediction errors for extrapolation to conditions very different than those 476 of the calibration data might behave very differently. 477

Skill measures 478
While prediction error is of course of interest, skill scores may be even more useful, as 479 they indicate how models compare to alternative methods of prediction. Note that the EF skill 480 score used here is somewhat different than the usual definition. Here, the naïve model is based 481 solely on the calibration data, so this is in fact a feasible predictor. The more usual definition 482 of the naïve model is the mean of all the data, including the data used for evaluation. Overall, 483 all except four modeling groups had smaller MSE (were better predictors) than the naïve model. 484 The EF criterion is a rather low baseline for evaluating the usefulness of crop models 485 for predicting phenology. Our second skill measure compares model MSE and MSE of the 486 onlyT model, which assumes a constant number of degree days from sowing to each Zadoks 487 stage, and estimates that number based on the calibration data. This should be a better predictor 488 than the naïve model if photoperiod and vernalization effects are limited, and so is a more 489 stringent test of usefulness of process models. We found that the onlyT model was indeed a 490 27 better predictor than the naïve model. Nonetheless, 19 of the modeling groups performed better 491 than the onlyT model. It seems that in most cases here, the added complexity in crop models 492 beyond a simple sum of degree days is warranted. More generally, we suggest that 493 systematically calculating a skill measure like skillT would give valuable information about the 494 usefulness of more complex models. 495

496
As found in many studies, e-median and e-mean had prediction errors comparable to 497 the best modeling groups. This confirmed previous evidence and theoretical considerations 498 showing that the use of e-mean or e-median is often a good strategy (Bassu et al., 2014;Palosuo 499 et al., 2011;Rötter et al., 2012;Wallach et al., 2018). The e-mean model is based on a simple 500 average over simulated values, so the results from every modeling group are weighted equally. 501 An open question in using model ensembles is whether it would be better to give more weight 502 to models that have smaller prediction errors for the calibration data (Christensen et al., 2010), 503 for example using Bayesian Model Averaging (Wöhling et al., 2015). The results here show 504 that phenology predictive performance for the calibration environments is significantly 505 correlated with predictive performance for new environments. This was also found to be the 506 case for a study evaluating phenology prediction by modeling groups based on phenology in 507 French environments (Wallach et al., 2019) and suggests that in these cases, it may be 508 worthwhile to use performance-weighted model ensembles. This may be due to the fact that in 509 these studies, the calibration and evaluation environments were similar to one another. In cases 510 where one is extrapolating to conditions quite different than those represented by the calibration 511 environments, performance weighting may be less useful. This once again emphasizes that it is 512 important to define for each evaluation study whether it is an evaluation of type "interpolation" 513 or "extrapolation".

28
4.5 Sources of variability 515 A major outcome of model ensemble studies is the variability in simulated values 516 between modeling groups, which is an indication of the uncertainty of model-based predictions 517 (Asseng et al., 2013). Beyond a measure of the variability, it is of interest to understand the 518 origins of the variability. One important aspect here is how differences in the model equations 519 between model structures affect the simulated values. This however is difficult to untangle, 520 given the multiple differences between structures. It seems that specific studies, for example 521 modifying one specific aspect of multiple models, are needed to understand the various sources 522 of structure uncertainty (Maiorano et al., 2016). The present study does not allow us to relate 523 specific differences in model structure to differences in simulated results. However, it does 524 allow us to separate two contributions to variability, namely the overall variability between 525 model structures and the variability between different parameter values for the same model 526 structure. An important question is the relative importance of the two, to determine priorities 527 for reducing overall uncertainty. Parameter uncertainty can arise from uncertainty in the default 528 values of those parameters that are fixed, from uncertainty in the choice of calibration approach 529 (for example, the form of the objective function or the choice of parameters to estimate) and 530 from the values of the estimated parameters, which are uncertain because there is always a 531 limited amount of data. The within-structure variability here is a measure of the uncertainty due 532 to choice of default values and calibration approach, but not of uncertainty in the values of the 533 calibrated parameters. The within-structure standard deviation here is 4.3 days, compared to a 534 between-structure standard deviation (contribution of structure) of 11.9 days. The study based 535 on French environments found a within-structure standard deviation of 5.6 days and a between-536 structure standard deviation of 8.0 days (Wallach et al., 2019). Confalonieri et al. (2016) also 537 found that the within-structure effect was in general, but not in all cases, smaller than the 538 between-structure effect on variability. 539 29 Other studies have on the contrary focused on structural uncertainty versus uncertainty 540 in the calibrated parameters, without taking into account uncertainty in all the default parameter 541 values, nor uncertainty in the calibration approach chosen. Zhang et al. (2017) found that model 542 structure explained about 80 % of the variability in simulated time to heading in rice and about 543 92 % of the variability in simulated time to maturity in rice, the remainder of the variability 544 being due to parameter uncertainty. Wallach et al. (2017) found that model structure uncertainty 545 contributed about twice as much variance as parameter uncertainty to overall simulation 546 variance. It would be of interest to have a fuller treatment of parameter uncertainty, including 547 both different groups using the same model structure and an estimate of the uncertainty in the 548 parameters estimated by each group. 549

550
We evaluated how well 28 crop modeling groups simulate wheat phenology in 551 Australia, in the case where both the calibration data and the evaluation data were sampled from 552 fields in the major wheat growing areas in Australia under current climate and local 553 management. It is important to distinguish between interpolation type prediction, as here, and 554 extrapolation type, since they are not evaluating the same properties of modeling groups. It is 555 also important to emphasize that evaluation concerns both model structure and parameter 556 values, and therefore the modeling group and not just the underlying model structure. MAE for 557 the evaluation data here ranged from six to 20 days depending on the modeling group, with a 558 median of 9 days. About two thirds of the modeling groups performed better than a simple but 559 relevant benchmark, which predicts phenology assuming a constant temperature sum for each 560 development stage. The added complexity of crop models beyond just the effect of temperature 561 is therefore justified in most cases. As found in many other studies, the multi-modeling group 562 mean and median had prediction errors nearly as small as the best modeling group, suggesting 563 that using these ensemble predictors is a good strategy. Prediction errors for calibration and 564 evaluation environments were found to be significantly correlated, which suggests that for 565 interpolation type studies, it would be of interest to test ensemble predictors that weight 566 individual models based on performance for the calibration data. The variability due to 567 modeling group for a given model structure, which reflects part of parameter uncertainty, was 568 found to be smaller than the variability due to model structure, but was not negligible. This 569 implies that model improvement could be achieved not only by improving model structure but 570 also by improving parameter values.

Figure S1 793
Predictions of days from sowing to Zadoks stages Z10 (emergence), Z30, Z65 and 794 Z90 by each modeling group for each evaluation environment. Modeling groups that 795 used the same model structure are identified by color (red for structure S1, green for 796 structure S2, blue for structure S3). 797