Modelling COVID-19 mortality at the regional level in Italy

Italy was harshly hit by COVID-19, registering more than 35,000 deaths between February and July, 2020. The virus spread unequally across the country, with northern regions witnessing more cases and deaths than those in the centre and south. We investigate demographic and socio-economic factors that contributed to the diverse regional impact of the virus in Italy. Within a smoothing framework, we divide regions into three well-defined groups of High, Middle and Low mortality by cluster analysis. Extending the Poisson regression model to account for regional clusters, we find that COVID-mortality is positively associated with the share of ICU utilization, GDP per capita, proportion of the older population and the number of COVID-19 positive cases, while it is negatively associated with the delay of region-specific outbreaks and the number of tests performed. Our results have relevant policy implications for potential resurgence of COVID-19 infections in Italy and across the world.


Introduction
The novel coronavirus disease  is an infectious disease that has rapidly spread globally since the beginning of 2020. The disease is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), and it was firstly identified in the city of Wuhan in December 2019 (Du et al., 2020). Within a matter of weeks, the World Health Organization declared the COVID-19 outbreak a public health emergency of international concern (January 30, 2020), and then a pandemic (March 11, World Health Organization, 2020b).
The global number of reported cases of COVID-19 has been rising at a very fast pace during 2020, increasing from about 10 thousands at the beginning of February to more than 17 millions at the beginning of August. Similarly, the number of deaths attributed to the disease has increased from around 250 to more than 675 thousands during the same period (Johns Hopkins University CSSE, 2020;World Health Organization, 2020a).
In Italy, the first case of COVID-19 was confirmed on February 20, although the virus was already present in the country since January (Cereda et al., 2020). Since the identification of "patient one", the country has been harshly hit by the spread of the virus, with regard to both infections and deaths. The number of reported cases exceeded that of China on March 27, totalling almost 250 thousand cases by the end of July. In addition, Italy registered more COVID-19 deaths than any other country between March 19 and April 7, surpassing 35 thousand deaths at the beginning of August (data from Johns Hopkins University CSSE, 2020).
In response to the rapid spread of the virus, the country adopted a series of measures to slow down its transmission. On February 23, eleven municipalities in the north of Italy were identified as the main cluster of the epidemic and put under quarantine. Simultaneously, six northern regions implemented different restrictions, ranging from school closures to cancellations of public, religious or sport events. On March 1, the Council of Ministers divided the country in three areas of risk: a red zone, comprising the eleven municipalities in quarantine; a yellow zone, comprising the regions of Lombardia, Veneto and Emilia-Romagna, where public and sport events were suspended and school closed; and the rest of the country, with softer safety and preventive measures. Then, schools were closed nationwide on March 4, the entire country was put in lockdown on March 9 and non-essential activities were suspended on March 23 (Ministero Della Giustizia, 2020). Unfortunately, all these measures only partially mitigated the diffusion of the virus.
Since the early phases of the Italian outbreak, regional differences emerged in terms of timing and magnitude of the virus' diffusion. At first, the virus spread in the north of Italy: the regions in the yellow zone (Lombardia, Veneto and Emilia-Romagna) were the first to confirm more than 200 cumulative infections at the start of March. Southern regions, which were initially rather unaffected, witnessed increasing rates of contagions throughout the months of March and April. In terms of COVID-related deaths, similar regional differences occurred, with northern regions generally more affected than southern ones (cf. Figure 4 in the Results section).
Economic and demographic differences across Italian regions have been the subject of a vast body of literature (see, e.g., Helliwell and Putnam, 1995;Billari and Ongaro, 1998;Di Giulio and Rosina, 2007;Kertzer et al., 2008). However, little is known about the effects of such differences in shaping the COVID-19 epidemic in Italy. In this paper, we analyse the demographic and socio-economic factors that contributed to the diverse regional impact of the virus in Italy. We focus on COVID-related mortality during the first half of 2020, specifically from the end of February to mid-July, a period during which about 35 thousands COVID-19 deaths were registered. We study the association between the reported number of deaths by COVID-19 and a set of explanatory variables at the regional level. We investigate the main explanatory variables that have been linked to the outbreak of COVID-19 across the world. By employing a Poisson regression model, we determine the demographic and economic variables that had the strongest impact on the number of deceased individuals. Finally, to take into account the different territorial experiences, we stratify our analysis by clustering regions that shared similar trajectories of the epidemic.
This article is organized as follows. Section 2 describes the methodology that we use for our analysis, the mortality data that we employ and the relevant covariates that we consider in our study. In Section 3, we illustrate the results of our analysis, and we conclude with a discussion of our findings in Section 4.

Mortality data and modelling
Since February 24, 2020, the Dipartimento Della Protezione Civile (2020) publishes the cumulative number of reported COVID-19 deaths for each of the 21 NUTS-2 Italian regions on a daily basis. Rather than the cumulative counts, we are interested in the new daily number of deaths, so we start our analysis from February 25. Let m be the number of days available in the dataset, and n be the number of regions. Let us denote by D = (d t,r ) the m × n matrix containing the observed deaths at time t = 1, . . . , m in region r = 1, . . . , n. Time t is a natural number between February 25 and July 15. Observed deaths d t,r are assumed to be realizations from a Poisson distribution (Brillinger, 1986) with meanȇ r µ t,r , whereȇ r denotes the population exposures to the risk of death for each region r and it is assumed fixed over the period under study. The vector µ t,r denotes the force of mortality at time t for each region r and its estimation is the object of the proposed model.
As mentioned, we first perform a preliminary step by classifying Italian regions into clusters that shared similar trajectories of the epidemic (see Section 2.3). This analysis will help to better disentangle the effects of explanatory variables as drivers of additional differences between regions. We then model our Poisson death counts via a log-link function in a Generalized Additive Model framework employing both linear and nonlinear effects. Specifically over time t, for region r belonging to cluster c, the logarithm of the force of mortality is given by where η 0 t is fixed for each region and it represents the common epidemic dynamic over time. Cluster-specific variations over time with respect to the common time-trend are described by γ c t . For each region r, the design matrix Z t,r contains the values of the explanatory variables that could eventually change over time t. The coefficients vector β is common over regions and it can be interpreted as in a classic regression setting. Note that, without loss of generality, we consider the first cluster as reference, i.e. estimated log-mortality for regions in this group is given by the simple sum of η 0 t and associated Z t,r β.
We assume smoothness for both common time-trend and cluster-specific distances. Following a P -spline approach, we model these function as a linear combination of Bsplines and associated coefficients which are penalized by discrete penalties (Eilers and Marx, 1996). Estimation procedure has been implemented in R (R Development Core Team, 2020) and the code can be obtained in the following blind repository: https: //osf.io/mu7hy/?view_only=7c03829c4c8a4859a37038581e1ce6fd. Additional statistical aspects of the model are provided in Appendix A.

Explanatory variables
Here we describe the explanatory variables that we employ in our regression setting, we provide their sources and some descriptive statistics.
We consider two types of covariates in our analysis: (1) "constant" variables that do not change over the time frame analysed; (2) a set of time-varying variables, which are allowed to change on a daily basis throughout the observation period. We start by introducing the first type of covariates.
Exposures data, i.e. the total population for each region at the start of the year 2019 (the last available date), is retrieved from Istat (2020) and they will serve as offset in the regression setting. Moreover, the same source provides data on the share of population aged 65 years or more, the population density, the average size of the households, and the gross domestic product (GDP) by regions. Furthermore, we obtain the regional number of nursing homes from the Istituto Superiore di Sanità (2020). Finally, to account for the fact that regions experienced outbreaks at different time points, we compute a proxy for the delay of the epidemic starting date. This is derived as the number of days (since February 25) until cumulative cases surpassed 0.0001% of the regional population. For example, this regional-specific threshold is 10 in Lombardia, 4 in Piemonte and 2 in Calabria. This allows us to consider the population size in the timing of the epidemic outbreak across regions, and the choice of the threshold (0.0001%) does not modify outcomes of the following analysis.
Furthermore, we consider four time-varying covariates. The first is time, defined as the number of days since the first available date in the dataset. The other covariates are the cumulative number of tests performed, the total amount of current positive COVID-19 cases (including both hospitalised patients and home confinement), and the percentage of intensive care units (ICU) utilization. The first two variables are provided by the Dipartimento Della Protezione Civile (2020). The last variable is computed by dividing the daily number of patients currently in ICU (retrieved from Dipartimento Della Protezione Civile, 2020) by the total ICU availability in 2019 (the last available date). Regional data on the number of ICU available in 2019 is obtained from the Ministero della Salute (2020). It should be noted that, in some regions, the percentage utilization of ICU exceeds 100%, because the denominator does not take into account the large additions to the stock of ICU in response to the outbreak of COVID-19 during the first months of 2020.
To summarize, Table 1 reports the explanatory variables that we use in our analysis, together with their characterization (constant or time-varying), the transformations that we perform to some variables, and the descriptive statistics of the transformed variables. Figures B.1 and B.2 in Appendix B provide an exploratory analysis of these variables and COVID-related mortality. This analysis allows us also to identify the appropriate transformations of the covariates used in the following.

Regional clustering
The variability displayed by COVID-mortality data at the regional level is very high, and a simple regression analysis with the covariates described in Subsection 2.2 is not powerful enough to provide an adequate fit to the data. We thus opt for a preliminary step in which we group Italian regions that have experienced a similar unfold of the pandemic. This allows us to incorporate in Eq. (1) cluster-specific distances with respect to the general time-trends, γ c t ; these distances describe changes over t devoid of the effects of the explanatory variable in Z t,r .  (2020); Dipartimento Della Protezione Civile (2020); Istat (2020); Istituto Superiore di Sanità (2020) Our aim is thus to classify regions according to their COVID-mortality time-trend. In practice, we first describe each regional mortality pattern in a non-parametric framework: where B is a matrix of k = 31 B-splines, which are common for all regions. Regionalspecific coefficients α r are estimated in a P -spline setting (Camarda, 2012). This approach reduces both randomness and dimensionality for our classification problem from m = 142, number of available days, to k, length of the vector α r . The estimated coefficients A = [α 1 , . . . , α r , . . . , α n ] contain all relevant features of the observed regional mortality trends, and our aim is to classify them, and consequently classify µ t,r . Several options are available when a cluster analysis is performed. In this study we opt for a k-means clustering approach, which partitions the n vectors in A into s sets so as to minimize the within-cluster sum of squares (Hartigan and Wong, 1979). This approach allows us to extract the principal components of the correlation structure of A. The investigation of these components helps in interpreting the outcomes of the cluster analysis and consequently underpin the choice of s on explicit conjectures. The R-routine kmeans() was used to perform the clustering on our matrix A (R Development Core Team, 2020).
The main problem in any classification approach concerns the choice of s, the number of clusters. Various criteria are available for optimizing the number of clusters and they have shown to produce diverse outcomes in our analysis. In the following we opt for a choice that balances interpretation of the produced clusters via the associated principal components and robustness of the outcomes from the regression analysis presented in Section 2.1. In other words, we select s such that the interpretation of both partitions and principal components of A is clear, and estimated β in Eq. (1) remain practically unchanged with respect to similar choices of s. Given this approach, we select s = 3 sets of regions.
Finally the k-means approach allows us to extract "centers" of each cluster providing outcomes which are interesting by themselves. These "centers" could serve as prototypes for identifying regional evolution of the pandemic, and they could be used to identify specific patterns regardless the effects of the explanatory variables included in the regression analysis.

Results
The first step of our model consists in dividing the 21 Italian regions into three clusters. Figure 1 shows the results of this procedure. The first two principal components (PCs) of the smooth COVID-19 mortality pattern capture 93.1% of the overall variability found in the P -splines coefficients. Whereas the left panel depicts the partitioning of the regions into three well-separated clusters, the right panels of Figure 1 show the first two principal components.
We labeled the three clusters High, Medium and Low mortality since the first PC captures a large part of the overall variability (76.3%), and it describes the average shape of the epidemic over time. Consequently, the values associated to this component describe the regional mortality levels and, with the exception of Toscana, the clustering is based on the different level of mortality. However, the following regression analysis will identify additional features in the cluster-specific mortality patterns.
Moreover, the second principal component (see bottom-right panel in Figure 1) represents how the main pattern deviates over time and, given its essential linearity, it indicates that regions with higher (lower) associated PC value have a relatively higher (lower) mortality in the first period with respect to the second.
The smooth mortality pattern by regions as well as the cluster centers are shown in Figure C.1 of Appendix C. Figure 2 shows the map of Italy, with regions coloured according to their corresponding cluster. Moreover, colour transparency is proportional to the regional values associated to the first principal component, with higher transparency corresponding to higher values of the component, and hence higher mortality level. As expected, the clustering division broadly follows the territorial positions of the regions, with Northern regions belonging to the High mortality clusters, Centre regions to the Medium one, and Southern regions to the Low one. However, some exceptions are observable, such as P.A. Trento and Bolzano, Friuli-Venezia Giulia, Abruzzo and Puglia. Regions at the border of the transparency-opacity gradient identify boundaries of the clustering divisions in terms of the first principal component. For example, the high transparency of Toscana and Abruzzo indicate that they are closer to the Medium cluster than regions with high opacity such as Lombardia and Liguria.
Next, we run our model on the covariates described in Subsection 2.2. To select the covariates that comprise our final model, we employ a bottom-up, or forward selection approach. Starting from a very parsimonious model, which includes only the cluster effects, we add covariates once at a time, and we retain them only when they are statistically significant (using the Wald test statistic). The rationale behind the bottom-up approach instead of a top-down (backward elimination) one is that some covariates are highly correlated between each other, so that employing a very large model can introduce collinearity in the regression setting. However, different selection procedures have been tested and lead to the same final model.   Table 2 show the results of our model. The figure displays the estimated reference and cluster-specific curves over time with 95% confidence intervals.
The reference curve, η 0 t , describes the average time-profile of the epidemic in the High mortality cluster with a rapid log-mortality increase up to mid-March, followed by a relatively stable pattern and a late downward trend from early June. It is important to also highlight two relatively small mortality increases during April and the end of May. The former increase is particularly puzzling as occurring while the entire country was in lockdown.
In addition, the right panel of Figure 3 presents the cluster-specific terms, γ c t , that directly act on the reference curve to modify both its level and curvature. Both curves have inverse-U shapes with more prominent edges in the Low cluster, i.e. whereas all Italian regions have shared a similar log-mortality pattern from the end of March to the end of April, regions in the Medium and Low clusters have lower mortality at the beginning and toward the end of the whole period. Moreover, the Low cluster dramatically approached log-mortality level in the High cluster around the end of March, but those regions simultaneously departed from the highest mortality level immediately after mid-April, and at a faster pace with respect to the regions in the Medium cluster.
Six variables are retained as statistically significant in the model selection: (i) the share of ICU utilization, (ii) the delay in the start of the epidemic, (iii) the cumulative number of tests performed, (iv) the GDP per capita, (v) the share of population aged 65 years or more, and (vi) the number of positive COVID-19 cases.
The signs of the estimated coefficients are in line with our expectations. The share of ICU utilization, the level of GDP per capita, the share of older population and the . Colour transparency is proportional to regional value associated to the first principal component.  Figure 3. Estimated smooth reference and cluster-specific curves with 95% confidence intervals from the Poisson regression model on COVID-related mortality and the six explanatory variables in Table 2.
different coefficients, we normalize each variable by mean-centering and scaling to one standard deviation (1SD). As such, the estimated coefficients can be considered as the change in the log-mortality corresponding to a 1SD increase in the variable. The strongest effect on mortality is found for the (squared root of the) share of ICU utilization, with a 1SD (3.4 percentage points, cf. Table 1) increase associated with a 0.92 increase in the log-mortality. The lowest effects are observed for the share of the older population and the number of positive cases, with coefficients equal to 0.09 and 0.08, respectively. It is worth noticing that the remaining covariates introduced in Subsection 2.2, namely population density, household size and the number of nursing homes, were not retained in our model. For population density and the number of nursing homes, the estimated coefficients were not statistically significant, leading to an increase in the BIC as compared to our selected model. For the average size of the household, issues of collinearity arose in the model (altering the signs of other estimated coefficients). This effect was mainly due to the high correlation between household size and two other variables in our model (GDP per capita and share of the older population, cf. Fig. B.2 in Appendix B).
The estimated model allows us to compare the observed and fitted regional evolution of the pandemic. Figures 4 show the observed and fitted COVID-19 death counts with 95% confidence intervals in each region as well as for the overall country (top-left panel). The model captures the data well, although fitted values tend to underestimate the data in Emilia-Romagna and can only partially capture the relatively different shape of the curve observed in Veneto. For completeness, in the Appendix C, we report a similar plot for log-mortality as well as the model's deviance residuals for a model diagnostic.

Discussion
The research community has been very responsive to analyse the spread of COVID-19 across the globe. For the purpose of this article, we focus on the relevant literature for the Italian context.
Early efforts have been directed towards monitoring the virus's spread at the national or subnational level using Poisson models. Chiogna and Gaetan (2020) proposed a dynamic generalized linear model for the Poisson distribution of new and total cases at the national, regional and provincial level. The analysis of the time-varying slope of the local linear trend allows them to detect changes in the underlying process in terms of acceleration, deceleration or stabilization of the diffusion of the disease. Moreover, Bonetti and Basellini (2020) introduced a tool for visualizing the spread of COVID-19 in Italian provinces and regions by modelling the total number of cases with Poisson regression and parametric hazard functions. Furthermore, Agosto et al. (2020) proposed a Poisson autoregression on the daily number of cases, and compared the Italian context with China and other European countries.
All these works consider the spread of the virus in different territories in isolation, i.e. each region or province is analysed independently without taking into account their correlations. However, regional contexts have played a relevant role in the unfolding of the Italian pandemic. As such, in this paper we have introduced a methodology that considers the country as the sum of different regional experiences. Our regression model allows us to identify the most significant variables that have contributed to the greater or lower burden of deaths across regional units from the end of February until mid July 2020. In particular, we find that the percentage of ICU utilization, the GDP per capita, the share of the older population and the number of positive COVID-19 cases are positively associated with COVID-19 mortality. Conversely, the delay in the start of the epidemic and the cumulative number of tests performed are negatively associated with mortality.
Few attempts have been made to take into account the cross-regional dependence in the spread of COVID-19 in Italy. Maltagliati (2020) was among the firsts to suggest analysing the Italian epidemic as the sum of region-specific outbreaks. The author describes the cumulative number of deaths using a logistic model that considers the delay in the start of the regional outbreaks. The proposed model produces large differences in the regional-specific asymptotes, and the author argues that the regional perspective is fundamental to understand the evolution of COVID-19 in Italy. Furthermore, Boschi et al. (2020) employ Functional Data Analysis techniques to investigate the association of COVID-19 mortality with mobility, positivity and other covariates at the regional level from mid-February to April. The authors document the outbreak of two starkly different epidemic types: an exponential one in the worst hit areas in the north of the country, and a flat one in the remaining regions.
Our analysis shares commonalities with these two studies, but they substantially differ in a number of ways. First, the parametric model proposed by Maltagliati (2020) does not consider the role of explanatory variables in shaping the effects of the COVID-19 outbreak across the regions. Second, while the semi-parametric analysis of Boschi et al. (2020) is conceptually closer to our approach, the focus of the two studies is rather different. In their work, Boschi et al. (2020) concentrate the analysis on the role of mobility and positivity as predictors of COVID-19 mortality, and other covariates are only considered -one at a time -as control variables in the regression model. In our work, we take a more comprehensive view and assess the competing effect of several factors on mortality during the pandemic. Nonetheless, we acknowledge the contributions of these two studies, and our work is meant to complement and provide additional insights on the dynamic of the Italian epidemic.
Our findings are generally in line with those reported by recent research on the COVID-19 pandemic. The share of ICU utilization has the strongest association with mortality, after controlling for the (cluster-specific) epidemic time trend and additional factors. Regions where ICU needs exceeded the available stock experienced the highest levels of COVID-19 mortality. The saturation of ICU beds and ventilator availability has indeed been an important factor to explain the higher number of COVID-related deaths in Lombardia and Italy (Favero, 2020;Volpato et al., 2020). Furthermore, the negative association between mortality and (i) the delay of the outbreak and (ii) the cumulative number of tests performed, are reasonable: regions that experienced later outbreaks had relatively more time to prepare for it, and higher number of tests could have allowed for a better tracing of infected individuals -and therefore a lower number of new cases and deaths. For example, an empirical study in the municipality of Vo' (Lavezzo et al., 2020) has shown that viral transmission of the disease can be suppressed when these factors interact (early isolation of infected people combined with community lockdown).
Finally, the positive, albeit comparatively weaker, associations of mortality with the remaining covariates (GDP per capita, share of older persons and number of positive COVID-19 cases) are also in line with previous studies. The COVID-19 mortality gradient in Italian regions closely matches the economic gradient, with Northern regions having higher GDP per capita than Southern ones. Furthermore, the role of population age structures has been suggested to explain the higher number of COVID-deaths in older versus younger populations (Dowd et al., 2020).
Interestingly, some of our findings are not aligned with recent research on the pandemic. Rocklv and Sjdin (2020) suggested that population densities are an important catalyst for the spread of COVID-19. Moreover, nursing homes have been identified as hotspots of COVID-related deaths in Italy (Trabucchi and Leo, 2020;di Giacomo et al., 2020). In our analysis, these two variables were not significantly associated with COVID-19 mortality, after controlling for other factors. Similarly, the dimension of households has been proposed as a key factor (together with the population age structure) to determine the vulnerability of countries to outbreaks of COVID-19 (Esteve et al., 2020). Conversely, the dimension of households is negatively correlated with COVID-19 mortality in Italian regions during the period analysed, and a similar negative correlation is observable across clusters for the number of nursing homes (see Figure B.2 in Appendix B). It is likely that these discrepancies are related to the contexts of the analysis, as our subnational setting differs from those considered in these studies. As such, our findings highlight the importance of context-specific analysis, providing a warning to the generalizability of COVID-related hypothesis and results.
In conclusion, our study shed light on the most significant factors that have contributed to the spread of COVID-19 in Italian regions. In addition to their scientific value, our findings provide important insights for policymakers in the event of future pandemics or a potential second wave of COVID-19. Finally, the methodology that we propose in this article is a novel contribution to the analysis of mortality during epidemics, which can be fully replicated and applied to other countries and frameworks (even outside epidemic research) using the codes provided along with our article.

A The regression model: statistical details
In this Appendix, we provide additional information about the estimation procedure we performed for the model introduced in Section 2.1. Beside the explanatory variables collected in the regional-specific matrices Z t,r , the proposed model requires two datasets as input data: the matrix of deaths D = (d t,r ) over time t for region r, and the vectoȓ e = (ȇ r ) that collects the population exposures to the risk of death for each region r. For the purpose of regression, we arrange death counts as column vectors, i.e. d = vec(D). Likewise we arrange the matrix of exposures e = vec(E), where E = (e t,r ) = 1 mȇ , with 1 m a m × 1 matrix of ones.
As mentioned in Section 2.1, we assume deaths to be realizations from a Poisson distribution with meanȇ r µ t,r = e t,r µ t,r . We then model our Poisson death counts for all Italian regions via a log-link function: ln [E (d)] = ln (e) + ln(µ) = ln (e) + η = ln (e) + X θ , where η = ln (µ) is the linear predictor. The design matrix X contains all features described in (1) and the coefficients vector θ is a vector of the model's coefficients which needs to be estimated.
In order to present all model components and without loss of generality, we sort n = 9 regions according to their cluster membership and we consider these regions equally distributed in s = 3 clusters. Design matrix and coefficients vector are then given by where B are k equally spaced B-spline bases and 0 are m × k matrices of zeros. The last columns in X are made of all region-specific m × p matrices with the values of the explanatory variables that could eventually change over t.
The vector θ concatenates all associated coefficients and consequently common timetrend and clustering deviation from it can be written as follows: Following a P -splines approach, we use a generous number of B-spline and enforce smoothness of η 0 , γ 2 and γ 3 by penalizing the associated coefficients. As result, we can adapt the iteratively re-weighted least-squares algorithm for estimating GLMs by adding a penalty term: where z = (d − e * µ)/e * µ + η is the working dependent variable and W is a diagonal matrix of weights, W = diag(e * µ). The penalty term P must work only on α 0 , α 2 and α 3 and it takes the following diagonal structure: where ∆ is a difference matrix over t. The smoothing parameters λ 0 and λ γ control the trade-off between smoothness of (α 0 , α 2 , α 3 ) and model fidelity. We assume second order differences for ∆ and the same smoothing parameter for all cluster-specific deviations. Finally the Bayesian Information Criterion was employed to optimize both λ 0 and λ γ .

B Exploratory analysis
In this Appendix, we perform some exploratory analysis of the dataset that we employ in our study. Figure B.1 shows the distributions, correlations and scatter plots of COVID-mortality (in log-scale) and the time-varying covariates described in Subsection 2.2, stratified by the three clusters of High, Medium and Low mortality. The Figure provides several information. The scatter plots in the last row show the relationship between mortality and the variables, and their correlations can be read from the last column of the graph. The non-linear relationship between time and mortality clearly emerges, as well as the strong linear relationship between mortality and ICU utilization (after a square root transformation). The negative correlation between cumulative tests and mortality is also evident, and the mortality density plot (last panel in the last row) show the ordering of the clusters in terms of different mortality levels (i.e. higher clusters are characterized by higher mortality).
A similar plot for the time-constant variables is provided in Figure B.2. On one hand, the six constant variables have a rather low correlation to mortality; on the other, some variables are highly correlated between each other.

C Additional results
Here we present additional results of our analysis. Figure  three patterns of the pandemic, with mortality level and shapes well stratified over time.
The three clusters are clearly ordered into High, Medium and Low mortality. Figures C.2 show the observed and fitted log-mortality rates of COVID-positive deceased individuals with 95% confidence intervals in each region as well as for the overall country (top-left panel).
Finally, Figure