A new Dual-Dagum-G (DDa-G) family is defined as a good competitor to the Beta-G and Kumaraswamy-G generators, which are widely applied in several areas. Some of its mathematical properties are addressed. We obtain the maximum likelihood estimates, and some simulations prove the consistency of the estimates. The flexibility of this family is shown through a COVID-19 data set. We propose a new regression based on a special distribution of the DDa-G family, and provide a sensitivity analysis by using data from 1,951 COVID-19 patients collected in Curitiba, Brazil.
Over the last few decades, many generators have been studied in the distribution theory literature. Two generators that stand out are the Beta-G (B-G) (Eugene et al., 2002) and Kumaraswamy-G (Kw-G) (Cordeiro & de Castro, 2011) classes.
Regarding the B-G family, we can say that, although it contains the incomplete and complete beta functions, its flexibility in terms of adjustment to real data is widespread. Several authors introduced new distributions in this family in different contexts: cancer recurrence (Paranaíba et al., 2011), waiting times before service of 100 bank customers (Abd El-Bar & Ragab, 2015), test on the endurance of deep groove ball bearings (Abu-Zinadah & Bakoban, 2017), survival times of 33 patients suffering from acute Myelogeneous Leukaemia (Mead et al., 2017), among others. More than one-hundred different published distributions in this class can be found to date.
The second family stands out because of the simplicity of its density function, which does not include complicated functions. Further, its suitability for the most diverse types of data sets is widely discussed in the literature. We can cite, for example, the work that originated this family and used data from adult numbers of T. confusum cultured at 29C (Eugene et al., 2002). Since then, many sub-models in this family have been proposed for various types of data sets: fatigue (Cordeiro et al., 2010), body mass index (Mameli, 2015), component lifetime (Cordeiro et al., 2016), among others.
We know through an analysis of these works that the fits of both classes to real data have a better performance compared to other known classes. We can note that the data sets studied in the aforementioned works are of different types. However, many authors end up repeating the same data sets used in previous works by other authors.
In this sense, we define a new class from the Dagum distribution (Dagum, 1975) and use data bases never published before. The data bases in question concern a very current topic: COVID-19. We understand the importance of studies on this pandemic that impacted the world, and then use COVID-19 data from two cities in Brazil.
The remainder of the paper is organized as follows. Section 2 defines the new family. In Section 3, we present some of its generated distributions. The main properties of the new family are reported in Section 4. Estimation including the case of censoring is addressed in Section 5. A simulation study is done in Section 6. In Section 7, we construct the Log-Dual-Dagum-Weibull regression, and estimate the parameters. Two applications to real data are reported in Section 8, including a regression application and a sensitivity analysis. Conclusions end the paper in Section 9.
2.The new family
The new generator is defined based on the survival function of the Dagum distribution (Dagum, 1977). Kleiber and Kotz (2003) and Kleiber (2008) analyzed characteristics and properties of this distribution. The Dagum distribution presents forms of the increasing, decreasing, bathtub and inverted bathtub risk function (Domma, 2002). This behavior has aroused the interest of several authors to study it in survival analysis (Domma et al., 2011a, b). In this sense, we propose the Dual-Dagum-G (DDa-G) family.
Let be a Dagum random variable with two positive shape parameters and . The cumulative distribution function (cdf) of the DDa-G family (for ) is
where is the baseline cdf, and is its parameter vector.
Henceforth, Eq. (1) refers to the random variable .
The probability density function (pdf) of has the form
The hazard rate function (hrf) of is
3.Special DDa-G distributions
The DDa-W density (for ) is defined from Eq. (2) and the Weibull cdf with scale and shape , namely
where all parameters are positive. For , we obtain the Dual-Dagum-Exponential distribution.
The cdf of the log-logistic (LL) distribution is (for )
Inserting this expression and its derivative in Eq. (2) leads to the DDa-LL density (for )
Figures 1 and 2 display shapes of the pdf and hrf of the previous generated models, which show their flexibility in fitting data with different shapes. For example, the Weibull pdf presents only decreasing and unimodal shapes, whereas the DDaW pdf has an extra shape: decreasing-increasing-decreasing.
For any real , the power series holds (Ward, 1934)
and , etc., are the Stirling polynomials.
By expanding the binomial term,
where (for ), and then we obtain
where and for .
Next, we use a theorem of Henrici (1993) for a power series raised to any real power different from zero
where the coefficients are determined recursively from and (for )
Formulas for other functions may be found in Hairer et al. (1993).
A random variable following the exponentiated-G (exp-G) distribution and power , say , has cdf , and pdf . Many exponentiated distributions were given in Table 1 of Tahir and Nadarajah (2015).
By differentiating Eq. (5) and using the concept of exp-G distribution, we can write
Equation (6) is the linear representation for the DDa-G family density in terms of exp-G densities. So, it can provide some mathematical properties for sub-models of the new family from exp-G properties.
Let be the quantile function (qf) of G (for ). Inverting in Eq. (1), the qf of becomes
Equation (7) reveals that the qf of the proposed family is a function of the baseline qf.
The skewness and kurtosis of can be calculated from the first four ordinary moments (see Section 4.3). Alternatively, approximations for the skewness () and kurtosis () can be obtained from Eq. (7) and the formulae
reported by Kenney and Keeping (1961) and Moors (1988), respectively.
From now on, let . The th ordinary moment of comes from Eq. (6)
Moments for several exp-G distributions reported by Nadarajah and Kotz (2006) give .
The generating function (gf) of follows from Eq. (6) as
where is the gf of .
The estimation of the unknown parameters of the DDa-G distribution is performed by the maximum likelihood method. Let be a sample from Eq. (2). Let be the parameter vector in . The log-likelihood function has the form
The R software has the AdequacyModel computational library (Marinho et al., 2019) as a good alternative for maximizing , obtain the maximum likelihood estimates (MLEs), their standard errors, and some statistical measures to evaluate the adequacy of a fitted distribution.
We adopt the exponential (E) baseline (with the expected value ) for the simulations to assess the accuracy of the MLEs in the ) model. In order to generate observations from the DDa-E distribution, we adopt the inversion method following the steps:
1. Generate ;
2. Return , where
We consider 2,000 Monte Carlo replications and the BFGS algorithm in the R software for maximizing the log-likelihood, obtain the MLEs and their averages, biases and mean square errors (MSEs). The simulation process is carried out as below:
1. Simulate DDa-E observations for fixed by means of the previous scheme.
2. Three scenarios considered are: , and (Setup 1); , and (Setup 2); and , and (Setup 3). The values of the true parameters are chosen arbitrary.
3. We calculate the MLEs from each generated data set, and obtain the averages, biases and MSEs.
Table 1 reports these findings. The average estimates converge to the true parameter values and the biases decrease when increases (for all scenarios).
7.Log-Dual-Dagum-Weibull (LDDa-W) regression
If follows density Eq. (3), then has the LDDa-W distribution, which reparameterized in terms of and , has the form (for )
where , , and .
The survival function corresponding to Eq. (8) is
The pdf of the standardized random variable (for ) is
The lifetimes are affected by known explanatory variables in many applications. Consider a sample of independent observations, where , and the log-lifetime and the log-censoring are assumed independent (under non-informative censoring).
The LDDa-W regression for the response variable is defined by
where is the random error with density Eq. (9), , is a scale, and and are shape parameters. The location vector , where , is , where is a known model matrix. The log-DDa-exponential (LDDa-E) regression is given by Eq. (10) with .
where is the observed number of failures. The MLE of can be found by maximizing Eq. (11).
The Weibull and Birnbaum-Saunders distributions are taken as baselines to prove the flexibility of the new family. The data sets were obtained from the open data portal of the Federal Government linked to the Ministry of Health and comprise events from 2020–2021 (accessed on August 23, 2021). The data portal is available at https://dados.gov.br/dataset/bd-srag-2020.
All computations are done in R using
AdequacyModellibraries, and the
goodness.fit()function with the “
SANN” method. The initial parameter values to maximize the log-likelihood were obtained through a heuristic method by using the
MASSpackage in the R language.
The new distributions are compared with well-known models belonging to the Kw-G and B-G classes using the statistics: Cramér-von Mises (), Anderson-Darling (), Kolmogorov- Smirnov (), and -values of the test. We present below the alternative densities for the applications.
• The Kumaraswamy-Weibull (Kw-W) density (Cordeiro et al., 2010) (for )
• The Beta Weibull (B-W) density (Lee et al., 2007), and explored by Cordeiro et al. (2013) (for )
where and is the beta function.
• The Beta-Birnbaum-Saunders (B-BS) density (Cordeiro & Lemonte, 2011) (for and )
• The Kumaraswamy-Birnbaum-Saunders (Kw-BS) density (Saulo et al., 2012) (for and )
In the following, we calculate descriptive statistics, MLEs, their standard errors (SEs) and adequacy statistics to compare the fitted distributions to the data sets.
8.1COVID-19 data in Recife
The first application represents the times (in days) of 564 COVID-19 patients from the date of entry in the Intensive Care Unit (ICU) until cure in Recife (State of Pernambuco). In this context, the cure characterizes the evolution of the case as hospital discharge. Discharge from hospital can only mean that the patient no longer needs hospitalization.
The descriptive statistics for the time until cure for COVID-19 data in Recife include: mean 20.85, SD 20.58, skewness 2.94, kurtosis 17.53, and minimum and maximum values 1 and 206, respectively. For these data, the total time under test (TTT) plot (Aaset, 1987) indicated an inverted bathtub form for the hrf (not shown here).
The values of the statistics , , , and the -values of the test, are reported in Table 2. The DDa-W model is better than the Kw-G and B-G classes for both Birnbaum-Saunders and Weibull baselines as shown in Table 2. So, the proposed family can provide better fits than the well-known B-G and Kw-G classes.
|Distribution||MLEs and SEs||-value|
The Vuong test (Vuong, 1989) also reveals that the DDa-W distribution is better than the DDa-BS ( 2.47), Kw-W ( 16.39) and B-W ( 14.76) distributions for a level of significance of 5%.
Figure 5 displays the histogram of the data, where represents the time and the fitted DDa-W density and some other densities. The plots confirm that the DDa-W distribution provides the best fit to the current data. Further, the Q-Q plots of the best four fitted distributions (not shown here) indicate that the DDa-W distribution provides a superior fit to the current data.
8.2Regression modeling applied to COVID-19 data in Curitiba (Brazil)
The study comprises the time (in days) elapsed from the date of hospitalization until death by the coronavirus, of 1,951 patients in Curitiba-PR, with all observations failing, that is, censored times were not considered in the study, with occurrences of death in 2020 and 2021.
The explanatory variables are (for ):
• : sex (1 masculine, 2 feminine);
• : age (in years).
The computational part is developed in R using
optim()function, and the “
We adopt the Akaike information criterion (AIC), corrected Akaike information criterion (CAIC), and Bayesian information criterion (BIC) to choose the appropriate model. We compare the fits of the LDDa-W Eq. (8) with the log-Kumaraswamy-Weibull or Kumaraswamy Gumbel (Kw-Gu) (Cordeiro et al., 2012), log-beta Weibull (LBW) and log-Weibull (LW) models. The densities for the alternative regressions are reported below:
-value 0.0001; -value 0.1.
• The LW (or Gumbel) density function
where , and ; see Johnson et al. (1995).
• The LBW density function
where , and . For more details, see Ortega et al. (2013).
• The Kw-Gu density function
where , and .
The failure rate function is useful to aid in model identification more suitable for the variable time. In this context, the TTT plot (not shown here) for the data under study shows an increasing appearance for the most part, but due to its final behavior, it indicates an inverting bathtub risk function. The descriptive statistics for the time until death for COVID-19 data in Curitiba include: mean 16.86, SD 15.54, skewness 3.75, kurtosis 29.77, and minimum and maximum values of 1 and 205, respectively.
Next, we provide results from the fit of the regression
where are independent random variables with density function Eq. (9).
Table 3 provides some findings from the regressions fitted to the current data. They indicate that LDDa-W model provides the best fit to the data. Further, all covariates (: sex and : age) are significant at 1%, and then the categories of these explanatory variables are statistically different.
Thus, the time to death decreases when the age increases. Regarding the patient’s gender, male patients present smaller time until death than female patients, since the estimate of its coefficient is negative.
After the LDDa-W regression estimation, the plots of the empirical and estimated survival functions support the model adequacy to these data.
Also, as part of the analysis, it is important to verify if there are observations influencing the model’s adjustment. A sensitivity analysis was carried out to investigate this fact using the Cook’s distance and will be presented below.
8.2.1Sensitivity analysis and influential observations
Under the Generalized Cook Distance (Cook, 1977), the observations #349, #826 and #897 are the ones that stand out the most, thus indicating that they can be possible influential observations.
The observation #349 refers to a female individual, aged 84 years and with a time of hospitalization until death of 172 days. The observation #826 is identified as a 78-year-old male and has a time to death of 6 days. And the observation #897 refers to the male individual aged 61 years old, whose hospitalization time until death was only 3 days. The observations #349, #826 and #897 represent individuals with peculiar behaviors, but do not show signs of error in data collection or transcription, and therefore must be kept in the database. The final model is given in Eq. (12).
The impact of possible influential observations detected should be analyzed in order to assess the estimates and sensitivity of the model. This analysis considers new estimates for the model parameters from sub-samples referring to the withdrawal of these observations individually and in groups.
It is considered that the changes in the estimated values for the parameters are not very expressive and there is significance of the explanatory variables when considering the level of 10%. In addition, there was no change in the sign of the coefficient of the explanatory variables, so the inclusion or exclusion of the identified observations does not presuppose changes in the interpretation of the results.
One of the main objectives of distribution theory is to define a family of models to better explain lifetime phenomenon in several areas of knowledge. We proposed the Dual-Dagum-G (“DDa-G”) family, which can generalize all classical continuous distributions. Its parameters are estimated by maximum likelihood, and a simulation study showed the consistency of the estimators. We showed the flexibility of the new family by means of two real COVID-19 data sets. We proved that the new Log-Dual-Dagum-Weibull (LDDaW) regression outperformed regressions based on well-known Kumaraswamy-G and Beta-G generators. After verifying the good fit of the new regression, a sensitivity analysis was performed, where it was possible to verify the occurrence of influential observations. As future work, it could be interesting to investigate other methods of sensitivity analysis, such as the local influence, if the results obtained through the Cook Distance prevail and still carry out a residual analysis for the new regression model.
Note: Computer codes in
Rlanguage can be provided to readers free upon request.
Aarset, M.V. ((1987) ). How to identify a bathtub hazard rate. IEEE Transactions on Reliability, 36: , 106-108.
Abd El-Bar, A.M.T., & Ragab, I.E. ((2015) ). The beta generalized inverted exponential distribution. International Information Institute, 18: , 421-430.
Abu-Zinadah, H.H., & Bakoban, R.A. ((2017) ). The Beta Generalized Inverted Exponential Distribution with real data applications. Revstat – Statistical Journal, 15: , 65-88.
Cook, R.D. ((1977) ). Detection of influential observation in linear regression. Technometrics, 19: , 15-18.
Cordeiro, G.M., & de Castro, M. ((2011) ). A new family of generalized distributions. Journal of Statistical Computation and Simulation, 81: , 883-898.
Cordeiro, G.M., & Lemonte, A.J. ((2011) ). The β-Birnbaum-Saunders distribution: An improved distribution for fatigue life modeling. Computational Statistics & Data Analysis, 55: , 1445-1461.
Cordeiro, G.M., Nadarajah, S., & Ortega, E.M.M. ((2012) ). The kumaraswamy gumbel distribution. Statistical Methods & Applications, 21: , 139-168.
Cordeiro, G.M., Nadarajah, S., & Ortega, E.M.M. ((2013) ). General results for the beta Weibull distribution. Journal of Statistical Computation and Simulation, 83: , 1082-1114.
Cordeiro, G.M., Ortega, E.M.M., & Nadarajah, S. ((2010) ). The Kumaraswamy Weibull distribution with application to failure data. Journal of the Franklin Institute, 347: , 1399-1429.
Cordeiro, G.M., Saboor, A., Khan, M.N., Ozel, G., & Pascoa, M.A.R. ((2016) ). The kumaraswamy exponential-weibull distribution: Theory and applications. Hacettepe Journal of Mathematics and Statistics, 45: , 1203-1229.
Dagum, C. ((1975) ). A model of income distribution and the conditions of existence of moments of finite order. Bulletin of the International Statistical Institute, 46: , 199-205.
Dagum, C. ((1977) ). A new model of personal income-distribution-specification and estimation. Economie Appliquáe, 30: , 413-437.
Domma, F. ((2002) ). L’andamento della hazard function nel modello di Dagum a tre parametri. Quaderni di Statistica, 4: , 1-12.
Domma, F., Giordano, S., & Zenga, M. ((2011) a). Maximum likelihood estimation in Dagum distribution with censored samples. Journal of Applied Statistics, 38: , 2971-2985.
Domma, F., Latorre, G., & Zenga, M. ((2011) b). Reliablity studies of Dagum distribution. Working Paper, 206: , 1-17.
Eugene, N., Lee, C., & Famoye, F. (2002) Beta-normal distribution and its applications. Communications in Statistics – Theory and Methods, 31: , 497-512.
Hairer, E., Norsett, S., & Wanner, G. ((1993) ). Solving Ordinary Differential Equations I. Springer, 8: , 528.
Henrici, P. ((1993) ). Applied and Computational Complex Analysis, Volume 3: Discrete Fourier analysis-Cauchy integrals-Construction of Conformal Maps-Univalent Functions. John Wiley & Sons Inc., 3: , 656.
Johnson, N.L., Kotz, S., & Balakrishnan, N. ((1995) ). Continuous univariate distributions. John Wiley & Sons, 2: , 732.
Kenney, J.F., & Keeping, E.S. ((1961) ). Mathematics of statistics. D. Van Nostrand Company, 1: , 429.
Kleiber, C. ((2008) ). A guide to the Dagum distributions. Modeling Income Distributions and Lorenz Curves – Springer, 5: , 97-117.
Kleiber, C., & Kotz, S. ((2003) ). Statistical size distributions in economics and actuarial sciences. John Wiley & Sons, 470: , 352.
Lee, C., Famoye, F., & Olumolade, O. ((2007) ). Beta-weibull distribution: Some properties and applications to censored data. Journal of Modern Applied Statistical Methods, 6: , 173-186.
Mameli, V. ((2015) ). The Kumaraswamy skew-normal distribution. Statistics & Probability Letters, 104: , 75-81.
Marinho, P.R.D., Silva, R.B., Bourguignon, M., Cordeiro, G.M., & Nadarajah, S. ((2019) ). AdequacyModel: An R package for probability distributions and general purpose optimization. PLoS ONE, 14: , e0221487.
Mead, M., Afify, A.Z., Hamedani, G., & Ghosh, I. ((2017) ). The beta exponential fréchet distribution with applications. Austrian Journal of Statistics, 46: , 41-63.
Moors, J. ((1988) ). A quantile alternative for kurtosis. Journal of the Royal Statistical Society. Series D (The Statistician), 37: , 25-32.
Nadarajah, S., & Kotz, S. ((2006) ). The exponentiated type distributions. Acta Applicandae Mathematica, 92: , 97-111.
Ortega, E.M., Cordeiro, G.M., & Kattan, M.W. ((2013) ). The log-beta weibull regression model with application to predict recurrence of prostate cancer. Statistical Papers, 54: , 113-132.
Paranaíba, P.F., Ortega, E.M.M., Cordeiro, G.M., & Pescim, R.R. ((2011) ). The Beta Burr XII distribution with application to lifetime data. Computational Statistics & Data Analysis, 55: , 1118-1136.
Saulo, H., Leão, J., & Bourguignon, M. ((2012) ). The kumaraswamy birnbaum-saunders distribution. Journal of Statistical Theory and Practice, 6: , 745-759.
Tahir, M.H., & Nadarajah, S. ((2015) ). Parameter induction in continuous univariate distributions: Well-established G families. Anais da Academia Brasileira de Ciências, 87: , 539-568.
Vuong, Q.H. ((1989) ). Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica: Journal of the Econometric Society, 57: , 307-333.
Ward, M. ((1934) ). The Representation of Stirling’s Numbers and Stirling’s Polynomials as Sums of Factorials. American Journal of Mathematics, 56: , 87-95.