Snow Level From Post‐Processing of Atmospheric Model Improves Snowfall Estimate and Snowpack Prediction in Mountains

In mountains, the precipitation phase greatly varies in space and time and affects the evolution of the snow cover. Snowpack models usually rely on precipitation‐phase partitioning methods (PPMs) that use near‐surface variables. These PPMs ignore conditions above the surface thus limiting their ability to predict the precipitation phase at the surface. In this study, the impact on snowpack simulations of atmospheric‐based PPMs, incorporating upper atmospheric information, is tested using the snowpack scheme Crocus. Crocus is run at 2.5‐km grid spacing over the mountains of southwestern Canada and northwestern United States and is driven by meteorological fields from an atmospheric model at the same resolution. Two atmospheric‐based PPMs were considered from the atmospheric model: the output from a detailed microphysics scheme and a post‐processing algorithm determining the snow level and the associated precipitation phase. Two ground‐based PPMs were also included as lower and upper benchmarks: a single air temperature threshold at 0°C and a PPM using wet‐bulb temperature. Compared to the upper benchmark, the snow‐level based PPM improved the estimation of snowfall occurrence by 5% and the simulation of snow water equivalent (SWE) by 9% during the snow melting season. In contrast, due to missing processes, the microphysics scheme decreased performances in phase estimate and SWE simulations compared to the upper benchmark. These results highlight the need for detailed evaluation of the precipitation phase from atmospheric models and the benefit for mountain snow hydrology of the post‐processed snow level. The limitations to drive snowpack models at slope scale are also discussed.

The observations of precipitation phase at the surface can consist of direct visual reports from various types of observers: trained meteorological staff reporting present weather (WMO, 2015), citizen scientists (Arienzo et al., 2021;Cifelli et al., 2005) and citizens reporting in social media (Casellas et al., 2021). These direct manual observations have a limited temporal resolution (Froidurot et al., 2014). Automatic observations of precipitation phase can be obtained from disdrometers measuring the size and velocity of hydrometeors (e.g., Pettersen et al., 2021). However, these devices are limited in their abilities to measure the properties of falling snow (Battaglia et al., 2010). Finally, ground-based vertically pointing Doppler radars such as the Micro Rain Radar can be used to retrieve the precipitation type in the atmospheric layers above the ground (Garcia-Benadi et al., 2020). Due to the limited number of observations of precipitation phase, models are used to provide spatially and temporally continuous estimates of precipitation phase (Harpold, Kaplan, et al., 2017).
Snowpack schemes generally rely on ground-based precipitation-phase partitioning methods (PPMs) that use near-surface atmospheric variables to determine the precipitation phase (Harpold, Kaplan, et al., 2017). The simplest approaches use fixed thresholds for dry-bulb temperature (commonly referred to as air temperature in the literature and in the rest of the text) to separate between rain and snow (Harpold, Kaplan, et al., 2017). Thresholds of 0°C or 1°C are often used in models (Bernier et al., 2011;Durand et al., 1993). Air temperature ranges can be also introduced to consider a mixture of rain and snow (hereafter referred as mixed precipitation) (Quick & Pipes, 1977). Temperature-only PPMs can provide reasonable performances for a given region (e.g., Kienzle, 2008) but cannot be applied over large spatial extents because of the spatial variability in rain-snow air temperature thresholds (Jennings et al., 2018). Humidity-dependent PPMs have been proposed to overcome this limitation and account for the effects of latent heat exchanges on precipitation phase. They can depend on dew point temperature , wet-bulb temperature (Ding et al., 2014;Wang et al., 2019), hydrometeor temperature derived from the psychometric equation (Harder & Pomeroy, 2013), simple humidity-dependent air temperature threshold (Feiccabrino et al., 2015; or statistical models combining air temperature and relative humidity (Froidurot et al., 2014). Humidity-dependent PPMs have been demonstrated to improve in phase prediction (Casellas et al., 2021;Froidurot et al., 2014;Harder & Pomeroy, 2013;Jennings et al., 2018) and resulting snowpack simulations (Jennings & Molotch, 2019;Wang et al., 2019) compared to temperature-only PPMs. However, all these ground-based PPMs ignore atmospheric conditions in the layers aloft which can limit their ability to predict the precipitation phase at the surface (Feiccabrino et al., 2015;Harpold, Kaplan, et al., 2017). They also neglect the impact on the snow cover of other precipitation types such as freezing rain (Quéno et al., 2018). combine near-surface variables with integrated information on the lower atmosphere such as air mass boundary categories (Feiccabrino et al., 2012) or low-level vertical air temperature lapse rate (Sims & Liu, 2015). More advanced diagnostic methods use information from the full vertical profile of air temperature (sometimes combined with humidity) obtained from upper-air observations or atmospheric models (Birk et al., 2021;Bourgouin, 2000;Reeves et al., 2014). The Bourgouin technique (Bourgouin, 2000), for example, calculates areas of positive melting energy and negative refreezing energy from air temperature in the lower atmosphere for determining the precipitation type: rain, snow, mixed precipitation, ice pellets and freezing rain. Birk et al. (2021) improved the Bourgouin technique by considering the vertical profile of wet-bulb temperature instead of air temperature. In mountainous terrain, vertical profiles of air or wet-bulb temperature can also be used to estimate the height of the snow level (Fehlmann et al., 2018;Haiden et al., 2011;Tobin et al., 2012) where the snow level is taken at the bottom of the snow-rain transition zone and is defined as the lowest level in the melting layer where snow is still present (Minder & Kingsmill, 2013). These diagnostic methods have been primarily developed for weather forecasters to improve forecast of high-impact winter weather. The precipitation phase and types can also be explicitly derived from detailed microphysics schemes (e.g., Morrison & Milbrandt, 2015;Thompson et al., 2008) implemented in numerical weather prediction (NWP) systems or regional climate models (RCM) at convection-permitting scales (e.g., Liu et al., 2017;Milbrandt et al., 2016;Rasmussen et al., 2011;Seity et al., 2011). These schemes simulate the complex evolution of solid and liquid hydrometeors and associated precipitation processes as a function of atmospheric dynamic and thermodynamic processes.
Outputs from high-resolution atmospheric models are increasingly used to drive snowpack simulations in mountainous terrain in support of hydrological and avalanche hazard forecasting and climate predictions (e.g., Havens et al., 2019;He et al., 2019;Quéno et al., 2016;Vionnet et al., 2016Vionnet et al., , 2019. These driving data can often represent the complex interactions between topography and atmospheric flow better than sparse meteorological measurements (Lundquist et al., 2019). However, the benefit of using the precipitation phase derived from these atmospheric models for snowpack simulations have been rarely assessed. Tobin et al. (2012) suggested that the rain-snow line derived from a NWP system can improve snowmelt and runoff simulations in an alpine catchment in Switzerland. Wayand et al. (2016) and Currier et al. (2017) showed that the precipitation partitioning from a microphysics scheme can improve snowpack simulation in a warm coastal mountain range. Finally, Quéno et al. (2018) used the output of a microphysics scheme to simulate the formation of ice layers at the top of the snowpack due to freezing precipitation in the Pyrenees Mountains. In the context of weather forecasting, feedbacks from forecasters from Canadian and French national weather services suggest that expert forecasts of the snow-rain transition rely more on post-processed diagnostics of NWP models than on the raw phase discrimination from the microphysics of NWP models.
The objective of this study is to conduct a systematic assessment of the abilities and limitations of atmospheric-based PPMs for snowpack modeling in mountain ranges covering different climatic zones. The following questions are asked: (a) To what extent do atmospheric-based PPMs improve upon ground-based PPMs for snowfall estimation and snowpack prediction in mountainous terrain? (b) Among the atmospheric-based PPMs, what is the benefit of using post-processed precipitation phase compared to raw phase discrimination from the microphysics scheme? In this context, the Global Environmental Multiscale (GEM) hydrological modeling platform (GEM-Hydro) (Gaborit et al., 2017;Vionnet et al., 2020), which incorporates the detailed snowpack model Crocus (Brun et al., 1992;Lafaysse et al., 2017;Vionnet et al., 2012) was deployed at 2.5-km grid spacing over the mountains of southwestern Canada and northwestern US for the period September 2019-June 2020. The atmospheric forcing fields were taken from the High Resolution Deterministic Prediction System (HRDPS), a GEM-based atmospheric prediction system at a convection-permitting scale (2.5 km) (Milbrandt et al., 2016). Two atmospheric-based PPMs were considered from the HRDPS: the output from a detailed microphysics scheme (Morrison & Milbrandt, 2015) and a post-processing algorithm determining the snow level and the associated precipitation phase. Two ground-based PPMs were also included as benchmarks: a single air temperature threshold at 0°C used by default in GEM-Hydro (Bernier et al., 2011) and a PPM using wet-bulb temperature . Model simulations were evaluated against manual observations of the precipitation phase and automatic measurements of snow water equivalent (SWE) at the daily and seasonal scale.
The paper is organized as follows: Section 2 presents the study area, the different PPMs, the configuration of GEM-Hydro and the evaluation data and methods. Results are then detailed in Section 3. The impact of the different PPMs on snowfall fraction is first compared and results of phase prediction and snowpack simulations are then presented. Section 4 contains a discussion about the main results of this study. Finally, Section 5 summarizes the results and offers concluding remarks.

Study Area
This study focuses on a region covering the southern parts of the provinces of British Columbia (BC) and Alberta in Canada as well as the Washington State and parts of the US states of Idaho and Montana (Figure 1). The study domain covers an area of 1,200 km (West-East) × 1,050 km (North-South). Six subregions are considered to capture the change in climate and snow conditions from the coastal mountain ranges toward the interior ranges of the Canadian and US Rockies (Table 1). These subregions were divided based on watershed boundaries in the North American Environmental Atlas Basin Watersheds data set (Commission for Environmental Cooperation, 2011). Subregions were kept separately for Canada and the US to facilitate the analysis given different reference data sets. . Gray dots show the locations of the automatic stations measuring snow water equivalent and snow depth. Orange squares represent the location of the manual stations reporting precipitation phase. Areas in green and orange represent the six subregions used for analysis in this study (Table 1). 10.1029/2021WR031778 5 of 25

Snowpack Model
The detailed snowpack model Crocus (Brun et al., 1992;Lafaysse et al., 2017;Vionnet et al., 2012) was used to assess the impact of different PPMs on snowpack simulations over the region of interest. Crocus is a multilayer snowpack scheme that simulates the seasonal evolution of the physical properties of the snowpack and its vertical layering. Crocus simulates the evolution of the thickness, density, liquid water content, temperature, age and snow microstructure of each snow layer. This study relies on the version of Crocus that has recently been implemented as an additional option for snow simulations in the Soil, Vegetation and Snow version 2 (SVS-2) land surface scheme (Garnaud et al., 2019). SVS-2 is based on the SVS land surface scheme Husain et al., 2016;Leonardini et al., 2021) used at Environment and Climate Change Canada (ECCC) for hydrological forecasting within the GEM-Hydro modeling platform (Gaborit et al., 2017;Vionnet et al., 2020). Within SVS-2, Crocus is coupled to a multi-layer soil model (Boone et al., 2000). In this version of Crocus, the maximum number of simulated snow layers was fixed at 20 to test a viable configuration for an eventual operational implementation; such a configuration needs to balance accuracy and computational time. 50 layers are typically used in the context of operational avalanche hazard forecasting (Vionnet et al., 2012).
In this study, Crocus in GEM-Hydro was used in offline (uncoupled) mode to simulate the evolution of the snow cover. Snowpack simulations were carried out from 1 September 2019-30 June 2020 using atmospheric forcing described at the next section and without any assimilation of snow observations. They were performed over the study domain shown on Figure 1 at 2.5 km horizontal grid spacing. This grid corresponds to a subset of the grid used by the HRDPS that provides operational meteorological forecasts at 2.5-km grid spacing over Canada and the northern part of the US (Milbrandt et al., 2016). The elevation from the HRDPS grid was used to make sure that the precipitation phase estimates (Section 2.4) from HRDPS were valid at the elevation of the grid used by Crocus. The limitations in terms of representation of the terrain are discussed in Section 4.2. Soil properties (sand/clay fractions) were obtained from the Global Soil Data set for use in Earth System Models (Shangguan et al., 2014, GSDE) at 30 arcsec whereas information about the land cover (vegetation type/fraction) were derived from the ESA CCI LC global map at 300-m grid spacing (European Space Agency Climate Change Initiative Land Cover; http://www.esa-landcover-cci.org/). As snow observations are collected in open fields and forest clearings (Section 2.5.2), the interactions with the vegetation and the parameterization of fractional snow cover are not activated in the version of GEM-Hydro used in this study. The snow cover fraction is set to 1 whenever snow is present on the ground as in Vionnet et al. (2016). The initial soil state on 1 September 2019 was taken from a 1-year spin-up of the land surface conditions driven by HRDPS atmospheric forcing (see Section 2.3). This spin-up simulation used a single air temperature threshold set at 0°C to split between rain and snow (referred as method GrdT0 in Section 2.4).

Atmospheric Forcing
Crocus requires the following atmospheric forcing: air temperature and specific humidity at a known level above the surface, wind speed at a known level above the surface, surface incoming longwave and shortwave radiations and solid and liquid precipitation rates. These hourly fields were obtained from the HRDPS for the surface incoming radiations, air temperature and humidity and wind speed. Successive short-term HRDPS forecasts (6-11 hr lead time) of the 00, 06, 12, and 18 UTC were concatenated to generate a continuous atmospheric forcing. Similar methods have been used in previous studies focusing on distributed snowpack modeling in mountainous terrain Vionnet et al., 2019). Air temperature/specific humidity and wind speed were taken at the lowest prognostic thermodynamic and momentum levels, respectively, for air temperature/humidity (approximately 20 m AGL) and horizontal wind speed (approximately 40 m AGL). No downscaling and horizontal interpolation of the atmospheric forcing was performed since Crocus simulations were carried on the same grid as the HRDPS (same location of the grid points and same elevation).
Total precipitation was obtained from a 24-hr gridded precipitation analysis available at 2.5 km over Canada and the northern US. This analysis is generated daily by the Canadian Precipitation Analysis (CaPA) system (Fortin Name Area ( Note. The elevation is taken from the High Resolution Deterministic Prediction System grid at 2.5 km grid-spacing (see Section 2.3) and may differ from the real elevation. The locations of these subregions are shown on Figure 1.  Lespinas et al., 2015). CaPA combines precipitation observations (precipitation gauge, radar) with the background field from the HRDPS using optimal interpolation to produce a quantitative precipitation estimate on the HRDPS grid at 2.5 km. The 24-hr background field is obtained from four successive accumulated 6-12hr precipitation forecasts from the HRDPS. In this study, the 24-hr CaPA analysis was then disaggregated at an hourly time step by linearly rescaling the HRDPS hourly precipitation (successive hourly forecasts at 6-11hr lead time issued four times per day) to match the 24-hr CaPA accumulated precipitation. At grid points where HRDPS 24-hr precipitation was negligible but precipitation was present in CaPA, a constant precipitation rate was assumed for that time period.

Precipitation Partitioning Methods
Different PPMs were used to derive the liquid and solid precipitation rates from the total precipitation available in the atmospheric forcing (Section 2.3). Two types of PPMs were tested: (a) ground-based PPMs, which only rely on near-surface variables used as benchmarks and (b) atmospheric-based PPMs, which account for the upper-air temperature and humidity profiles simulated by the HRDPS. Table 2 gives an overview of the different PPMs that were tested.
Two ground-based PPMs of various complexity were considered since these methods are classically used in hydrological and land surface models (e.g., Harder & Pomeroy, 2013;Feiccabrino et al., 2015;Harpold, Kaplan, et al., 2017). These two PPMs were chosen to provide estimates of typical minimal and maximal performances of ground-based PPMs and thus provide lower and upper benchmarks against which to compare the performances of atmospheric-based PPMs, following the approach of Seibert et al. (2018). As in , the lower benchmark, referred to as GrdT0, consists of a PPM based on a single air temperature threshold set at 0°C. This temperature-only PPM is the default method used in GEM-Hydro (Bernier et al., 2011). It has known limitations since it overpredicts rainfall occurrence at air temperature slightly above 0°C and does not account for the effects of near-surface humidity on the precipitation phase (e.g., Marks et al., 2013). Therefore, a humidity-based PPM was used as upper benchmark. The scheme proposed by Wang et al. (2019) was selected since its has shown good performances across the contiguous US. This scheme, referred to as GrdTW, relies on the wet-bulb temperature, T w (in K), to compute the fraction of total precipitation falling as snow, f s : where the parameters a (6.99 × 10 −5 ), b (2.0 K −1 ) and c (3.97 K) have been optimized by Wang et al. (2019) to fit the relationship between observed snowfall fraction and observed T w reported in Behrangi et al. (2018). Note that the expression for f s given in Equation 1 differs from the erroneous expression given in Equation 1 of the paper by Wang et al. (2019).
Two atmospheric-based PPMs were considered. The first method uses the direct outputs of the Predicted Particle Properties (P3) microphysics scheme (Morrison & Milbrandt, 2015) implemented in the HRDPS. Within P3, the physical properties of solid particles (bulk density, etc.) are computed from the four prognostic ice variables (total ice mass, rime ice mass, rime volume and total number); then, the specific type of solid precipitation type is determined based a property-based decision tree. Freezing rain is identified in the model when liquid precipitation falls at subfreezing near-surface air temperature. P3 provides information on the hourly accumulation of four main precipitation types at the surface: rain, snow, ice pellets and freezing rain. These hourly values were Name Type Details

GrdT0
Ground-based Air temperature threshold at 0°C

GrdTW
Ground-based Snowfall fraction function of wet-bulb temperature  AtmMP Atmospheric-based P3 microphysics scheme (Morrison & Milbrandt, 2015) AtmSL Atmospheric-based Snow level derived from the Latent Heat Method (Appendix A) Table 2 Precipitation Partitioning Methods Used in GEM-Hydro extracted from the HRDPS archive for the same successive short-term forecasts (6-11 hr lead time) used to build the atmospheric forcing (Section 2.3). The hourly fraction of each precipitation type was then derived and added to the atmospheric forcing used by GEM-Hydro. Within GEM-Hydro, these fractions were combined with the hourly total precipitation amount to obtain the hourly amount of each precipitation type. Ice pellets and freezing rain were considered as frozen precipitation by Crocus, as the advanced module to account for ice layer formation due to freezing precipitation from Quéno et al. (2018) has not yet been merged in a stable and recent code version. The impact of this assumption is discussed in Section 4. This atmospheric-based PPM, relying on the microphysics scheme, is referred to as AtmMP in the rest of the text.
The second atmospheric-based PPM relies on the precipitation type derived from the Latent Heat Method (LHM) recently developed at ECCC in support of the forecasting of high-impact winter events. LHM is based on a post-processing of the HRDPS outputs to provide hourly forecasts of precipitation types. The method computes the height of the snow level defined here as the height above sea level where falling snow has completely melted. At each HRDPS grid point, the snow level is obtained from the vertical profile of wet-bulb temperature and hourly total precipitation amount simulated by HRDPS. This method accounts for a lowering of the snow level due to cooling by evaporation and fusion of snow falling through the melting layer (Kain et al., 2000;Minder & Kingsmill, 2013;Thériault et al., 2012). Once the snow level is known, the algorithm determines the precipitation type selected among six types: rain, snow, mixed precipitation (rain and snow), freezing rain, ice pellets and mix of ice pellets and freezing rain. More details about the LHM approach are given in Appendix A. As for the microphysics scheme, the hourly precipitation types were taken for the same successive HRDPS short-term forecasts (6-11 hr lead time) used to build the atmospheric forcing (Section 2.3). The hourly fractions of rain, snow, ice pellets and freezing rain were then determined (Table A1) and used in GEM-Hydro as described for AtmMP. This atmospheric-based PPM, relying on the snow level, is referred to as AtmSL in the rest of the text.
Four different atmospheric forcing data sets have been generated and used to drive distributed snowpack simulations with Crocus. These forcing data sets are the same, except for the separation between rain and snow (Table 2).

Precipitation Type
Hourly information on observed precipitation type and present weather were taken from the Aviation Routine Weather Reports (METARs) and the Aviation Selected Special Weather Reports (SPECIs) generated from manual observations (WMO, 2015). Six precipitation types were considered from the codes in the METAR and SPECI reports: drizzle, rain, snow, ice pellets, freezing drizzle and freezing rain. Reports from METARs and SPECIs can be issued at 5-min interval in between two full hours. For this reason, a 30-min time window was considered around a given full hour to determine at which hour an observation of precipitation type should be associated. For example, a snowfall occurrence reported at 14:15 UTC was associated with the HRDPS forecast valid at 14:00 UTC. The hourly observed air temperature was also extracted from the same stations. The observations were extracted from 1 September 2019-30 June 2020. Only the stations with elevations within ±200 m from the elevation of the 2.5 km grid were kept for the analysis to limit the errors in precipitation type that would result from elevation differences between the model and the observations. Out of the available stations, five stations were discarded. The locations of the 49 remaining stations are shown on Figure 1.
For each hour, precipitation types for the different PPMs were considered to be categorical. For the GrdT0 method, the classification between rain and snow was straightforward based on the 0°C air temperature threshold. The GrdTW method calculates the fraction of total precipitation falling as snow (Equation 1). This fraction was transformed into three categorical precipitation types-rain, mixed precipitation (rain and snow) and snow-using thresholds of 0.33 and 0.66 as in Casellas et al. (2021). For the AtmMP method, for a given hour, the precipi tation type with the largest hourly fraction was selected as the dominant precipitation type for this hour. If the hourly fractions of rain and snow were both larger than 0.33, the precipitation type for this hour was classified as mixed precipitation. Finally, the categorical precipitation types were already available for the AtmSL method.
The link between hourly predicted and observed precipitation type was then investigated using dichotomous contingency tables for snow and rain. The occurrence of mixed precipitation in the PPM estimate was considered as a predicted event for both rain and snow (see Appendix B). The evaluation was only carried out during periods with simultaneous observed and simulated precipitation. It was restricted to hours when (a) the observed near-surface air temperature was between −5°C and 5°C, which corresponds to the typical air temperature range where precipitation phases overlap (Casellas et al., 2021;Froidurot et al., 2014) and (b) the difference between the observed and the simulated near-surface air temperature was in the range (−1°C; 1°C) to limit the effect of erroneous air temperature forecast on the phase estimation. Snowfall that may occur at air temperature larger than 5°C in semi-arid areas was not considered in this evaluation (Thériault et al., 2018). Several thresholds (0.1, 0,2, 0.5, 1, 2 mm hr −1 ) were also considered for the hourly HRDPS precipitation to investigate if the ability of the PPMs to discriminate precipitation type depends on precipitation intensity. Four evaluation metrics were finally considered to evaluate model performance: the Probability of Detection (POD), the False Alarm Ratio (FAR), the Frequency Bias Index (FBI) and the Heidke Skill Score (HSS). POD measures the fraction of observed events that is predicted correctly whereas FAR is the fraction of predicted events that did not happen (false alarms). In addition, FBI relates the number of times an event is predicted with the number of times it was observed. Finally, HSS summarizes the results and measures fractional improvements over random chance. More details about these metrics are given in Appendix B.

Snow on the Ground
Daily co-located automatic measurements of SWE and snow depth were obtained from the CanSWE database in Canada (Vionnet, Mortimer, et al., 2021) and from the SNOTEL network in the US (Serreze et al., 1999). CanSWE includes automatic SWE and snow depth measurements from Alberta Environment and Park in Alberta and from the Ministry of Environment and its partners in BC. SWE is measured using snow pillows or snow scales whereas ultrasonic sensors are used to measure snow depth. Quality-Control (QC) procedures were applied to the snow data and include (a) range thresholding for SWE, snow depth and bulk snow density (define as the ratio between SWE and snow depth) and (b) automatic outliers detection using the robust Mahalanobis distance as in Hill et al. (2019). More details about the QC are provided in Vionnet, Mortimer, et al. (2021). Total precipitation data were also obtained for the SNOTEL stations. These stations are generally located in small forest clearings and suffer from a typical 5%-15% wind-induced undercatch of snowfall across the western US (Sun et al., 2019;Yang et al., 1998).
Times series of daily simulated SWE, snow depth and total precipitation were extracted for the different Crocus simulations at the locations of automatic snow stations. The grid point with the closest elevation to the actual station elevation was selected among the four nearest model grid points. Only stations with actual elevation within 200 m from the elevation of the selected grid point were kept for the analysis. This selection was applied to limit the impact on snow simulations of errors in precipitation phase and amount resulting from elevation differences between the model and the observations . It reduced by 26% the number of stations used for model evaluation (Table S1 in Supporting Information S1). Performance metrics for SWE and snow depth (Bias and Root Mean Square Error, RMSE) were considered separately for the snow accumulation and melting seasons since the sensitivity of snowpack simulations to PPMs depends on the season (Günther et al., 2019;Jennings & Molotch, 2019). At each station, the accumulation season was considered from the onset of observed continuous snow presence until the maximum observed SWE is reached whereas the snow melting season extended from the date of maximum observed SWE until the end of the continuous snow cover (Leonardini et al., 2021). At stations with shallow snowpack (maximal SWE lower than 200 mm) where the snow cover may not be continuous over the course of the winter, the accumulation and melting season were taken from the date of the first day with snow after 1 November until the date of maximum SWE and from the date of maximum SWE until the last day with snow presence, respectively. The impact of errors in total precipitation on Crocus performances was also considered for the SNOTEL stations. For each subregion in the US, biases and RMSE for SWE and snow depth were computed for a subset of stations where the simulated total precipitation is in sufficient agreement (±20%) with the observed total precipitation (including an expected 10% wind-induced undercatch at the SNOTEL sites). Table S1 in Supporting Information S1 shows the number of selected stations. Results for SWE simulations are discussed in the text (Section 3.3) whereas results for snow depth are provided in the Supporting Information S1.
Following Seibert et al. (2018), the performances of snow simulations generated with the atmospheric-based PPMs were compared to the performance benchmarks obtained using ground-based PPMs. A relative performance measure, R rel , was defined to allow the evaluation of the performance of a given atmospheric-based PPM relative to the performances of the upper and lower benchmarks (GrdTW and GrdT0): where Atm* refer to the atmospheric-based PPMs (AtmMP or AtmSL). R rel was computed for SWE and snow depth at stations where RMSE GrdT0 > RMSE GrdTW . R rel < 0 indicates worse performances than the lower benchmark whereas R rel > 1 shows improvements compared to the upper benchmark. Performances in between the two benchmarks are indicated by 0 ≤ R rel ≤ 1.
Daily changes in changes in SWE(ΔSWE) and snow depth(Δ SD) were also computed from the observed and simulated time series of SWE and snow depth. Quéno et al. (2016) have shown that considering daily changes in SWE and snow depth allows targeted evaluation of snow models focusing on specific processes (fresh snow accumulation, strong melting events, etc.). In the present study, daily changes in SWE were considered to investigate in detail two processes impacted by the choice of the PPMs: (a) daily snow accumulation; and (b) snow melt during strong precipitation events. Several ΔSWE thresholds were considered for daily snow accumulation (2, 5, 10, 25 mm) and two-by-two contingency tables were generated for each experiment. The HSS and FBI were then derived from these tables (Appendix B). Results for positive ΔSD are provided in the Supporting Information S1. The evaluation of simulated snow melt during precipitation events was restricted to the SNOTEL stations in the US where precipitation data are available. Strong precipitation events with snow melt were defined as days with negative ΔSWE and daily total precipitation larger than 10 mm following Musselman et al. (2018). Several thresholds were considered (−2, −10, −25, −50 mm) for ΔSWE and HSS and FBI were then derived for each experiment and threshold. Figure 2a shows the snowfall fraction obtained with the GrdTW method from 1 September 2019-30 June 2020.

Impact of PPMs on Snowfall Fraction
The snowfall fraction is defined as in  and represents the ratio between the total amount of precipitation falling as snow over a given period divided by the total amount of precipitation (snowfall + rainfall) over the same period. Low snowfall fractions are found in the Pacific coastal regions and in the associated mountain ranges (Vancouver Island in Canada, Olympic Mountains in the US; Figure 1) and in the Columbia basin of eastern US Washington state, lying between the Cascade and Rocky Mountains. High snowfall fractions (above 0.7) are found in the colder and higher-elevation Coast and Rocky Mountains in Canada and in the Cascade and Rocky Mountains in the US. In all subregions, the snowfall fraction estimated by the GrdTW method increases with elevation (Figures 3a-3f).
The spatial differences of snowfall fractions between the upper benchmark (GrdTW) and the other PPMs are shown on Figures 2b-2d and analyzed as a function of elevation on Figure 3. The GrdT0 method systematically reduced the snowfall fraction ( Figure 2b). The largest decreases in snowfall fraction (up to 0.18) were found in the elevation range 500-2,000 m for the different subregions considered in this study (Figures 3g-3l). Differences between the GrdTW and GrdT0 methods were lower for the regions of elevation higher than 2,000 m present across the domain since most of the precipitation falls with air temperatures below the freezing point in these regions. Atmospheric-based PPMs (AtmMP and AtmSL) provided estimates of snowfall fractions closer to the GrdTW method than the GrdT0 method (Figures 3g-3l). The AtmMP method estimated an average of 0.03 lower snowfall fraction in the elevation range 500-2,000 m. This average reduction is minimal (−0.01) in the Canadian Rocky Mountains (Figure 3g) and is maximal for the coastal US region (−0.03, Figure 3k). On the other hand, the PPM based on the estimation of the snow level (AtmSL) generated an average snowfall fraction 0.02 higher than the GrdTW method in the elevation range 500-2,000 m. The largest differences were found in the coastal regions (0.03 in coastal Canada and 0.04 in coastal US).

Evaluation of Precipitation Type
The four PPMs considered in this study were evaluated in terms of their ability to detect the occurrence of rainfall ( Figure 4) and snowfall ( Figure 5). The lower benchmark, GrdT0, generated a strong overestimation of rainfall occurrence (FBI larger than 1.15; Figure 4a) associated with numerous false alarms (Figure 4d). The GrdTW and AtmSL methods were the best performing methods in terms of HSS and FBI for the detection of rainfall for all the precipitation thresholds considered on Figure 4. In particular, they led to nearly unbiased estimation of rainfall occurrence (FBI close to 1), except for the higher thresholds (1 and 2 mm hr −1 ) where AtmSL underestimated the occurrence of rainfall. For the method relying on the microphysical scheme (AtmMP), the overestimation of rainfall slightly increases with increasing precipitation threshold. Its performances in terms of mean HSS lies in between the two benchmarks with an improvement of 10% compared to GrdT0 and a degradation of 6% compared to GrdTW. In terms of HSS, the performances of all PPMs to detect the occurrence of rainfall decrease with increasing precipitation threshold above 0.2 mm hr −1 (Figure 4b).
Results obtained for the detection of snowfall ( Figure 5) are consistent with those obtained for rainfall ( Figure 4). The GrdT0 method strongly underestimated the occurrence of snowfall (FBI lower 0.65 for all precipitation thresholds) and generated the lowest HSS among the four PPMs. As expected, the upper benchmark, GrdTW, greatly improved the detection of snowfall compared to GrdT0 (increase of 41% in mean HSS). The best HSS was obtained with the AtmSL method, that provided nearly constant HSS (around 0.85) for the different precipitation thresholds. In particular, AtmSL increased the mean HSS by 5% compared to GrdTW. This is mostly due to a better detection of snowfall for precipitation larger than 1 mm hr −1 (Figure 5c). AtmSL also provided nearly unbiased estimation of snowfall occurrence for precipitation lower than 0.5 mm hr −1 and led to a slightly overestimated occurrence for larger thresholds (Figure 5a). The three other PPMs showed continuously decreasing HSS with increasing precipitation threshold, mostly due to a decrease in POD with increasing threshold (Figure 5c). This suggests that the dependency of the height of the snow level to the precipitation intensity in AtmSL (Appendix A) contributes to the improvement of the snowfall detection with this method and can contribute to a slight overestimation of snowfall occurrence for the larger precipitation thresholds. The AtmMP method, relying on the microphysics scheme, generated less false alarms than AtmSL and GrdTW but it was also associated with less detection of snowfall occurrences and a FBI systematically lower than 0.8 The overall performances of AtmMP in terms of mean HSS were lower than AtmSL (−16%) and GrdTW (−12%) and larger than GrdT0 (+25%) (Figure 5b).

Impact on Snowpack Simulations
The four PPMs evaluated in the previous section were used in combination with other HRDPS atmospheric forcing to drive snowpack simulations with Crocus at 2.5-km grid spacing. The results of these experiments are evaluated below for the accumulation and melting seasons as defined in Section 2.5.2. The performances of the benchmarks are presented first followed by the analysis of the results of the two atmospheric-based PPMs.

Snow Accumulation Season
Model performances during the accumulation season are shown on Figure 6 for the different subregions considered in this study. For the subregions located in the US, the error metrics (bias and RMSE) were also derived for a subset of stations with close agreement between observed and simulated precipitation amounts. Such analysis aims at reducing the impact on model performances of compensatory errors between precipitation phase and amount. For all PPMs, this selection of stations improved the distributions of error metrics and reduced their spread, illustrating the impact of errors in precipitation amount on simulated SWE during the accumulation season. Therefore, in the rest of Section 3.3, model performances for the US subregions are analyzed using the subset of stations with close agreement between observed and simulated precipitation amounts.
In agreement with Figure 2, the sensitivity of the error metrics to the choice of the PPMs varies regionally. The largest sensitivity was found in the coastal regions in Canada and the US. In these regions, the upper benchmark (GrdTW) strongly reduced the errors in simulated SWE during the accumulation season compared to the lower benchmark (GrdT0). The decrease in median RMSE with GrdTW reached −15% in coastal Canada and −50% in coastal US. The interior regions in Canada and the US showed an overestimation of SWE during the accumulation season with both benchmarks. This SWE overestimation was associated with an overestimation of the occurrence of positive daily changes in SWE (ΔSWE) larger than 25 mm with both benchmarks (Figures 7h and 7k). This suggests that the overestimation was associated with the CaPA precipitation forcing rather than the choice of the PPM. Nonetheless, GrdTW improved the median RMSE compared to GrdT0 by 7% in interior Canada and 1% in interior US. In Canadian Rockies, the two benchmarks generally underestimated SWE during the accumulation season, in agreement with an increasing underestimation of positive ΔSWE with increasing ΔSWE threshold ( Figure 7i). GrdTW reduced this negative median bias by 26% compared to GrdT0. Finally, in the US Rockies, the upper benchmark led to nearly unbiased SWE simulation during the accumulation season and reduced the median RMSE by 2% compared to GrdT0. Overall, as expected, these results show that the upper benchmark relying on wet-bulb temperature generally improved snow simulations compared to the lower benchmark using a fixed air temperature threshold at 0°C.
The model's ability to simulate SWE using the precipitation phase taken from the microphysics scheme (AtmMP) was within the interval of performances set by the two benchmarks (GrdT0 and GrdTW), in agreement with the results obtained for the detection of snowfall and rainfall (Figures 4 and 5). This is illustrated by the distributions of the relative performance measure, R rel , mostly lying between 0 and 1 for all subregions. Compared to the lower benchmark, the AtmMP method improved SWE simulations during the accumulation season at 89% of the US and Canadian stations used to derive R rel . However, when compared to the upper benchmark (GrdTW), the AtmMP method improved SWE simulations at only 15% of these stations. In particular, the largest decrease in model performances compared to GrdTW is found for the coastal regions in Canada and the US (increase of median RMSE +14% in both subregions). In these regions, AtmMP underestimated the frequency of daily increase in SWE for all ΔSWE thresholds compared to GrdTW (Figures 7g and 7j).
For the snow accumulation season, the atmospheric-based PPM based on the snow level (AtmSL) led to results comparable to the upper benchmark (GrdTW). Indeed, the distribution of the relative performance measure, R rel , in all subregions is close to 1, the value that indicates the same level of performance as the upper benchmark. In fact, 59% of the stations in Canada and the US used to derive R rel show improvements in SWE simulations with AtmSL compared to GrdTW. The model ability to simulate SWE with AtmSL varies from one subregion to another. In coastal regions, AtmSL reduced the negative SWE bias compared to GrdTW but led to an increase in median RMSE in coastal Canada (+5%) and a slight decrease in coastal US (−1%). In the interior regions, AtmSL improved upon GrdTW with a decrease in median RMSE by −4% and −9% in interior Canada and US, respectively. Finally, the US Rockies show an increase of 15% in median RMSE. Overall, the median RMSE increased by 2% when considering all the subregions. Figure 8 shows the error metrics in SWE during the snow melting season as defined in Section 2.5.2. Model performances during the melting season result from (a) errors in peak SWE at the onset of the melting season due to errors during the accumulation season and (b) errors in the simulation of the snow melt dynamics. In agreement with the results obtained during the accumulation season, the upper benchmark (GrdTW) improved the upper end of the RMSE distributions (75th and 90th percentiles) in all subregions during the melting season compared to the lower benchmark (GrdT0). The median RMSE is also improved in all subregions with the exception of interior Canada and Canadian Rockies. In particular, GrdTW led to a noticeable improvement compared to GrdT0 in terms of snow melt dynamics during days with observed daily precipitation larger than 10 mm ( Figure 9). Indeed, the HSS in the US regions is systematically larger with GrdTW than GrdT0 (increase of mean HSS by +59%, +14%, +7% in coastal US, interior US and US Rockies, respectively). In particular, Figure 9d shows that the GrdT0 method strongly overestimated the occurrence of negative daily changes in SWE, ΔSWE, for low ΔSWE thresholds during days with precipitation. It is in agreement with the overestimation of rainfall occurrence with this method (Figure 4c) that resulted in erroneous occurrences of rain-on-snow events.

Snow Melting Season
The atmospheric-based PPMs showed contrasted performances during the snow melting season and confirmed the results obtained during the accumulation season. The method relying on the microphysics scheme (AtmMP) presented intermediate performances between the lower (GrdT0) and upper (GrdTW) benchmarks. The 10th and 90th percentiles of the distributions of the relative performance measure (R rel ) for AtmMP were between 0 and 1 for all subregions. Compared to the lower benchmark, AtmMP improved SWE simulations during the snow melt season at 95% of the stations in Canada and the US used to derive R rel . However, when compared to the upper benchmark, AtmMP led to improvements at only 6% of those stations. In contrast, the method AtmSL (based on the snow level) presented distributions of the relative performance measure with median values larger than 1 for all subregions, showing a general tendency to improve SWE simulations during the melt season compared to the upper benchmark. Overall, 76% of the stations in Canada and the US used to derive R rel showed improvements with AtmSL compared to GrdTW. The median RMSE across all stations was reduced by −9%. These improvements are larger in the coastal regions with a decrease in median RMSE by −10% and −11% in coastal Canada and US, respectively. Snow melt occurrence during precipitation events was systematically underestimated for all PPMs in the interior US and US Rockies (Figures 9e and 9f), suggesting that it can be associated with limitations in the atmospheric forcing and in the snowpack scheme as discussed in Section 4.4. Figure 6. Box plots for model biases (top), Root Mean Square Error (middle) and relative performance measure (R rel , bottom) in snow water equivalent for the automatic stations within each subregion (Figure 1) during the snow accumulation season (Section 2.5.2). Box plots for a subset of stations with close agreement between observed and simulated precipitation amount are also shown for each subregion in the US (Section 2.5.2) (see box plots with label Pcp. after the name of the subregion). These subsets of stations for the US subregions are used in the set "All." The box shows the median (middle bar) the interquantile range (25th-75th percentiles) and the 10th and 90th percentiles (whiskers). The numbers in italics show the numbers of stations in each subregion. R rel is only available for the atmospheric-based precipitation-phase partitioning methods (Equation 2 in Section 2.5.2). R rel < 0 indicates worse performances than the lower benchmark (GrdT0) whereas R rel > 1 shows improvements compared to the upper benchmark (GrdTW).

Comparison of Ground-Based and Atmospheric-Based PPMs
This paper has examined the ability of two atmospheric-based PPMs at predicting the precipitation phase in the mountainous regions of south-west Canada and north-west United States and has quantified their impact on snowpack simulations. Two ground-based PPMs of varying complexity were used as benchmarks following the approach of Seibert et al. (2018). The GrdT0 method relying on a fixed air temperature at 0°C was used to quantify the minimal performances that can be achieved by a PPM and thus provides a lower benchmark. As expected, this method produced poor estimation of the precipitation phase with a strong overestimation of rainfall occurrence and a strong underestimation of snowfall occurrence, in agreement with previous studies (Froidurot et al., 2014;Harder & Pomeroy, 2013;Jennings et al., 2018). It led to a strong underestimation of snow accumulation and an overestimation of snow melt during precipitation events in the coastal mountain ranges where the sensitivity of snowpack simulations to the choice of the PPMs is maximal. An upper benchmark was also used to represent the typical maximal performances that can be achieved by a ground-based PPM. A humidity-based PPM was selected as upper benchmark since such a method tends to produce the most accurate estimate of precipitation phase when using near-surface variables across large spatial scales (Harder & Pomeroy, 2013;Jennings et al., 2018). The humidity-based PPM from Wang et al. (2019) was selected as upper benchmark. It led to large improvements in the estimation of snowfall and rainfall occurrences and resulting snowpack simulations compared to the GrdT0 method. Other well established humidity-based PPMs (Harder & Pomeroy, 2013;Jennings et al., 2018) could have been used as upper benchmark.
Atmospheric-based PPMs present a priori a strong potential to improve phase estimate in complex terrain and resulting snowpack prediction since they include upper-air information on air temperature and humidity (Currier et al., 2017;Minder & Kingsmill, 2013;Wayand et al., 2016). However, this study shows that these improvements are not systematic. Indeed, the P3 microphysics scheme (Morrison & Milbrandt, 2015) used in the HRDPS decreased the quality of the estimation of the precipitation phase (−6% for the estimation of rainfall, −12% for the estimation of snowfall) compared to the upper benchmark (GrdTW) that only relies on near-surface wet bulb temperature. The microphysics scheme underestimated snowfall occurrence and overestimated rainfall occurrence. The performances of snowpack simulations in terms of SWE were also decreased, mainly in the coastal regions. These results are consistent with the absence of the modeling of the liquid fraction on mixed-phase particles in the version of P3 used in HRDPS (Cholette et al., 2019;Milbrandt et al., 2016). Without the representation of the liquid fraction, melted water formed during partial melting of solid hydrometeors is directly transferred to the rain category, instead of forming wet snow. A similar limitation is expected for other microphysics schemes since they treat melting of solid hydrometeors as P3 (e.g., Morrison et al., 2009;Thompson et al., 2008). Differences among the schemes may result from the treatment of ice growth and riming, affecting the fall speeds of solid hydrometeors, impacting the trajectory of precipitation, and in turn, the distribution of precipitation at the surface (Mo et al., 2019). Cholette et al. (2019) further developed the P3 scheme through the addition of a prognostic variable for the liquid fraction, thereby allowing the scheme to represent mixed-phase particles. It is expected that this version of P3 will be run operationally following a future upgrade of the HRDPS. It has the potential to improve phase estimate during winter storms when air temperatures in the lower atmospheric layers are near 0°C (Cholette et al., 2020). The results of this study highlight the need for the hydrology community to critically examine the quality of the precipitation phase from microphysics schemes before using it in snowpack schemes.
Among the four PPMs considered in this study, the best performances in terms of snowfall occurrence were obtained with the AtmSL method, an atmospheric-based PPM that derives the height of the snow level from the vertical profiles of wet-bulb temperature simulated by an atmospheric-model. This method has been initially developed to support operational mountain-weather forecasting in BC. Compared to the upper benchmark, the snow-level approach improved the detection of snowfall by 5%. These improvements in phase estimation affected positively the SWE simulations during the melting season with a decrease in median RMSE by 9% Figure 8. Same as Figure 6 for the snow melting season (Section 2.5.2). and improvements found at 76% of the stations used to derive the relative performance measure. During the accumulation season, the median RMSE increased by 2% and improvements were found at 59% of the stations used to derive the relative performance measure. These moderate improvements were found for the mountains of southwestern Canada and northwestern US and further evaluation are needed in other mountainous regions.
These results illustrate the potential benefit of the snow-level method for mountain snow hydrology. Tobin et al. (2012) showed similar results when using the rain-snow line from a NWP system in the Swiss Alps. These results also support the recommendations of Harpold, Kaplan, et al. (2017) to improve the link between atmospheric and hydrological models for a better determination of the precipitation phase. Atmospheric-based PPMs require additional data processing compared to more simple ground-based PPMs that only rely on near-surface variables to compute the precipitation phase. Results of this study show that a humidity-based PPM can achieve performances that are close to a more complex atmospheric-based PPM in mountainous terrain. The final selection of a PPM in a snowpack model is a trade-off between computational requirements and performances. The use of atmospheric-based PPMs will grow in the hydrology community if the snow level can be provided as a direct and standard output from atmospheric models without the need for the hydrologists to recompute it.
The AtmSL method has also the benefit of providing estimation of freezing rain and ice pellets accumulation that can be useful for snowpack applications (Quéno et al., 2018). In this study, freezing rain and ice pellets have been treated as solid precipitation as it is the case in all hydrological models using ground-based PPMs. For freezing rain, such assumption can lead to an overestimation of snow depth when it is treated as snow in the model instead of forming a dense ice layer at the surface of the snowpack. However, freezing rain occurrence is low in western Canada (McCray et al., 2019), which limits the impact of this assumption in the context of this study. Future application of the LHM method in eastern Canada where freezing rain is more frequent in winter (Mekis et al., 2020) will potentially require a specific treatment of freezing rain in Crocus as proposed by Quéno et al. (2018).

Limitations of Atmospheric-Based PPMs in Mountains
The two atmospheric-based PPMs used in this study are limited in mountainous terrain since their estimations of precipitation type (as a categorical variable) are only valid at the elevation of the atmospheric model grid cell. Therefore, to use them as input without introducing additional uncertainties due to downscaling (Ge et al., 2019), a snowpack model needs to run over the same grid as the atmospheric model. Crocus within GEM-Hydro used the same horizontal grid and orography as the HRDPS, despite the fact that the HRDPS orography is filtered and may not be the most accurate representation of the orography at 2.5-km grid spacing. As the horizontal grid spacing of atmospheric models used in mountainous terrains decreases, the errors in atmospheric-based PPMs due to elevation errors and differences will decrease (Vionnet et al., 2019). For example, ECCC is operating on a daily basis a version of the HRDPS at 1-km grid spacing over the mountains of southwestern Canada. Since the model is using the same physics as the pan-Canadian HRDPS at 2.5-km grid spacing, similar conclusions are expected for the snow level approach and the microphysics scheme in terms of estimation of the precipitation phase.
Nonetheless, the kilometer grid spacing of atmospheric models applied in mountains is still too coarse compared to the snowdrift-permitting scales needed to capture the small-scale variability of the snowpack in mountainous terrain (25-250 m) (e.g., Blöschl, 1999;Clark et al., 2011;Havens et al., 2019;Vionnet, Marsh, et al., 2021). At these resolutions, large differences of elevation can be found with the original grid of the atmospheric model. Meteorological downscaling methods are often used in snowpack models to take into account these elevation differences and to better reproduce the effect of the orography on the near surface meteorology. When using ground-based PPMs, the near-surface air temperature and humidity are first downscaled to account for elevation differences and the precipitation phase is then simply recomputed (e.g., Bernier et al., 2011). Downscaling can be applied to categorical variables (Ge et al., 2019) such as the precipitation type from an atmospheric-based PPMs. However, such approach has not been tested yet and may require strong assumptions to account for elevation differences. For the snow level, two additional downscaling approaches are possible. If the vertical profiles of air temperature and humidity are available for the lower atmospheric levels, they can be downscaled to the resolution of the snowpack model as proposed for example, by Haiden et al. (2011). These downscaled profiles can then be used as input to LHM to derive the snow level and the precipitation phase at the resolution needed by the snowpack model. This approach has been tested with success by the ECCC forecasters in BC but it is computationally expensive. If only the snow level is available at the resolution of the atmospheric model, it can be used directly to derive the precipitation phase for grid points of the snowpack model with elevation higher than the original atmospheric model as in Tobin et al. (2012). For grid points with an elevation lower than the elevation of the original atmospheric model, additional downscaling and adjustments of the precipitation phase are needed if the snow level is below the elevation of the original atmospheric model. The development of these adjustment methods will allow the use of atmospheric-based PPMs in snowpack models running at snowdrift-permitting scales.

Spatial Variability in Precipitation Phase
The sensitivity of the snowfall fraction to the choice of the PPM varied across the study domain (Figures 2 and 3). It was most sensitive in the coastal mountain ranges in the elevation range 500-2,000 m. Reduced sensitivity was found in the colder interior mountain ranges and at higher elevation. The sensitivity was also reduced at the lowest elevations of some regions where rain dominates, for example, below 500 m in the coastal regions. The snowpack simulations showed similar sensitivity and the largest changes in SWE performance metrics among the different PPMs was found in the coastal regions (Figures 6 and 8). Jennings and Molotch (2019) found similar sensitivity between warm maritime sites and cold continental sites in the mountain ranges of the western US. Spatial variability was also found in the difference of snowfall fractions between the PPMs. For example, the AtmSL and GrdTW methods presented repeated spatial patterns of differences driven by the orography of the different mountain ranges of the study domain (Figure 2d). The limited number of evaluation stations used in this study restricted the ability to evaluate the physical realism of this local variability in the rain-snow transition, which requires further investigations. In addition, these stations are mostly located at low elevation compared to the surrounding mountains (Figure 1), highlighting the importance of indirect validation carried out with stations measuring snow on the ground located at higher elevation. Further evaluation is required to extend spatially the comparison of atmospheric-based PPMs and ground-based PPMs, in particular in eastern Canada where precipitation with near-0°C surface conditions is frequent (Mekis et al., 2020).

Uncertainties in Snowpack Simulations
The different PPMs led to contrasting snowpack simulations with Crocus, especially in the coastal regions. Errors in the snowpack simulations can result from errors in the estimation of the precipitation phase but also from errors in the rest of the atmospheric forcing and in uncertainties in the snowpack model (Günther et al., 2019;Lafaysse et al., 2017;Raleigh et al., 2015). In particular, the total precipitation from the CaPA analysis presents limitations in the mountains of western Canada (Fortin et al., 2018;Schirmer & Jamieson, 2015). When possible, the analysis of model performances was restricted to stations with a close agreement between observed and CaPA precipitation amount. However, it was only possible for the SNOTEL stations in the US which limits the quality of the evaluation in the Canadian regions. In addition, the wind-induced undercatch value of 10% assumed for the observed precipitation may be underestimated at some stations located in the Rocky Mountains and the interior Columbia basin as reported in Sun et al. (2019). The Crocus simulations remained also affected by errors in the near-surface air temperature and humidity and in the incoming shortwave and longwave radiation. In particular, a bias in the HRDPS incoming longwave radiation may explain the underestimation of snowmelt during precipitation events found in the interior regions and in the Rocky Mountains Raleigh et al., 2016). Further evaluation of the HRDPS meteorological forcing is required in mountainous terrain. The surrounding forest may also modify the incoming radiation for snow pillows located in forest clearings (Musselman et al., 2015). Finally, uncertainties in the parameterizations used in Crocus also affected the snowpack simulations. Ensemble snowpack simulations using the multiphysics version of Crocus (Lafaysse et al., 2017) could be used to confirm the significance of the reported differences between PPM against the uncertainties of the snowpack model. All these uncertainties in the atmospheric forcing and in the snowpack model can potentially explain why the upper benchmark did not lead to systematic improvements of snowpack simulations at each station compared to the lower benchmark, despite improved phase estimates. As a result, the number of stations used to derive the relative performance measure was reduced compared to the total number of stations available to derive the absolute performance metrics (RMSE and Bias; Figures 6 and 8).

Conclusion
This study aims to quantify the benefit of atmospheric-based PPMs for snowfall estimation and snowpack prediction in mountainous terrain. Toward this goal, a configuration of the GEM-Hydro modeling platform including the detailed snowpack model Crocus was used to simulate snowpack evolution over the mountains of southwestern Canada and northwestern US at 2.5-km grid spacing. Atmospheric forcing fields for Crocus were taken from the Canadian HRDPS at 2.5 km horizontal grid spacing. Total precipitation was taken from the CaPA. Two atmospheric-based PPMs were considered from the HRDPS: (a) the direct output from the P3 microphysics scheme and (b) a post-processing algorithm determining the snow level and the associated precipitation phase from the vertical profile of wet-bulb temperature (method AtmSL). Two ground-based PPMs were also included as benchmarks: a single air temperature threshold at 0°C (lower benchmark) and a PPM using wet-bulb temperature from Wang et al. (2019) (method GrdTW, upper benchmark). Simulation results were evaluated against manual measurements of precipitation phase and automatic measurement of SWE. Relative performance measures were computed for the two atmospheric-based PPMs using the benchmarks as references.
The main conclusions of the study are as follows: • Across the study region, the snowfall fraction was most sensitive to the choice of the PPMs in the coastal regions in the range 500-2,000 m. Reduced sensitivity was found in the colder interior mountain ranges. • As expected, among the ground-based PPMs, the upper benchmark including the effect of humidity on the precipitation phase led to systematic and large improvements of precipitation phase estimations and SWE predictions compared to the 0°C PPM. • Compared to the upper benchmark, the P3 microphysics scheme decreased the performances in phase predictions (relative decrease in HSS of −12% for snowfall estimate). During the snow accumulation (melting) season, it only improved SWE simulations at 15% (6%) of the stations used to derive the relative performance measure. The missing representation of the liquid fraction on mixed-phase particles in P3 may explain this result. These findings motivate the need for a systematic evaluation of the precipitation phase derived from microphysics schemes. Despite their physical realism, these schemes still present limitations for the estimation of the precipitation phase. • Using the snow level from post-processing of the atmospheric model improved the estimation of precipitation phase compared to the upper benchmark (relative improvement in HSS of +5% for snowfall estimate). SWE simulations were improved during the snow melting season (relative decrease in median RMSE by −9% and improvements found at 76% of the stations used to derive the relative performance measure). This illustrates the value of including upper-air information on air temperature and humidity when deriving the precipitation phase in mountainous terrain.
The results demonstrate that the precipitation phase derived from the snow level of an atmospheric model presents an interesting potential to improve snowfall estimate and snowpack prediction in the mountains of western Canada and US. Further work is required to (a) allow the use of the snow level in the context of high-resolution snowpack simulations (50-250 m) capable of capturing the spatial variability of the snow cover in mountainous terrain (e.g., Vionnet, Marsh, et al., 2021) and (b) extend the evaluation of atmospheric-based PPMs to other mountainous regions.

Appendix A: The Latent Heat Method
The Latent Heat Method (LHM) has been developed at Environment and Climate Change Canada to provide forecasts of the snow level and corresponding precipitation types. This post-processing method uses a top-to-bottom approach to derive the elevation of the snow level defined as the height (m) of the lowest level in the melting layer where snow is still present (Minder & Kingsmill, 2013). The methods operates at an hourly time step at each model grid cell and considers the vertical profile of wet-bulb temperature, T w . It uses the full extent of vertical grid of the operational High Resolution Deterministic Prediction System which consisted at that time of 62 levels with eight of them in the lowest 1,000 m of the atmosphere (Milbrandt et al., 2016). Latent Heat Method assumes that snow starts melting at the first model level from the top (identified here with index N) with T w > T f (T f = 273.15 K). Latent heat exchanges in the melting layer below are then taken into account to derive the snow level following Kain et al. (2000). Falling snow is considered fully melted at level M when the total energy gained through the different layers (from N to M) from melting (dq m ) and adjusted from condensational warming to avoid supersaturation (dq c ) reaches the total energy required to melt the hourly accumulated precipitation (Q m ). At level M: where Q tot = L f D with L f the latent heat of fusion (J kg −1 ) and D the hourly accumulated total precipitation (kg m −2 ). The melting energy for level k is given by where c p is the air specific heat capacity (J K −1 kg −1 ), Z is the elevation of level k (m), T w,avg the average wet bulb temperature (K) (T w,avg = (T w,k+1 + T w,k )/2) and ρ a,avg the average air density (kg m −3 ). The condensation energy for level k, dq c,k , is written as: where L c is the latent heat of condensation (J kg −1 ). d rs represents the change in the mixing ratio (-) of moist air due to condensation: where e v,avg is the average pressure of water vapor (Pa), es(T) the saturation pressure of water vapor (Pa) at air temperature T and p avg the average air pressure (Pa). The air density at level k is calculated as: where p k is the air pressure at level k (Pa), T a,k the air temperature at level k (K), R the universal gas constant (J K −1 mol −1 ) and M d and M v the molar mass (kg mol −1 ) of dry air and water vapor, respectively.
The snow level is defined as the elevation of level M. Incorporating the precipitation rate in the computation of the snow level allows LHM to capture the lowering of the snow level associated with intense precipitation (Minder & Kingsmill, 2013). If the entire atmosphere is at T w < 0°C, a value of −2 is assigned to the snow level whereas a value of −1 is assigned if the snow is partially melted (level N above the surface and level M not reached before the ground). The precipitation type is then derived from (a) the snow level, (b) the height above the surface of level N, (c) the near-surface air and wet-bulb temperatures and (d) the negative refreezing energy in the layer below the snow level as in Bourgouin (2000). A total of 6 types is considered (

Appendix B: Scores Obtained From Contingency Tables
Dichotomous contingency tables (Table B1) were used to evaluate the ability of the different PPMs to determine the precipitation type (Figures 4 and 5) and to reproduce the daily changes in SWE (Figures 7 and 9). Four evaluation metrics were used. Using the cell counts defined in Table B1, Range of POD and FAR are 1 to 0. The perfect scores for POD and FAR are 1 and 0, respectively. FBI relates the number of time an event is predicted with the number of time it was observed. A ratio of one indicates an unbiased forecast. Finally, the HSS is a commonly used metric for summarizing dichotomous contingency table (Nurmi, 2003). HSS ranges between −1 and 1, with 1 for a perfect forecast, 0 for a random forecast and negative values for a forecast worse than random forecast. Note. The letters correspond to the total number of occurrences matching a specific combination.