Variable-weighted ultrametric optimization for mixed-type data: continuous, ordinal, nominal, binary symmetric and binary asymmetric

Author: 
Peer reviewed: 
No, item is not peer reviewed.
Date created: 
2009
Keywords: 
Data mining
Data mining -- Mathematical models
Knowledge acquisition (Expert systems)
Cluster analysis
Hypothesis generation
Ultrametric optimization
Data mining
Cluster analysis
Abstract: 

Scientific research begins with hypothesis generation, for which cluster analysis (CA) can be used. Traditionally, CA involves continuous variables weighted equally, and the subjective choice of linkage and stopping rules. Variable weighting for cluster analysis (VWCA), beginning with De Soete (1985/6), produces weights that may be useful for hypothesis generation. De Soete’s VWCA optimized ultrametricity, a property of better separated clusters, without requiring CA. We developed variable-weighted ultrametric optimization for mixed-type data (VWUO-MD), starting with a variable-weighted, multivariate distance for data with any number of continuous, ordinal, nominal, binary symmetric and binary asymmetric (e.g., rare disease) variables. In Monte Carlo simulations we found that weights are consistent with a priori relationships between variables, under several distributions. On some relationships (e.g., single group linear), the method performs poorly. Compared to De Soete, VWUO-MD better penalizes for 0-weights, and better ensures a unique solution with a strategic random restart procedure. The bootstrap covariance matrix is slightly conservative. For mixtures of at least four continuous/nominal variables, a U-statistic-based covariance matrix performs well. Point estimates and covariances are invariant to column/category/record order and affine transformations. We analyzed of a subset of the Joint Canada/United States Survey of Health: working, mature students 50+ years old who received health services in the past year (n=167), split into training and testing segments. Prescreening within types and backwards elimination with VWUO-MD reduced the space. The final 14 variable weights were plotted as a scree plot. On the testing segment, a model was fit from the upper scree plot variables. Similar models were fit from the lower scree plot, prescreening and backwards elimination reject variables. Models were ordered on overall statistical significance and the upper model had the best fit, indicating that VWUO-MD had successfully mined these data for hypotheses. We learned that reduction in activities due to a long term health condition was associated with consultations with a mental health professional in the past year (odds ratio=12.25, 95% CI=1.67, 90.02). While needing additional research, in its present form VWUO-MD produces variable weights that may be informative for hypothesis generation on data with varied mixtures of data types.

Language: 
English
Document type: 
Thesis
Rights: 
Copyright remains with the author. The author granted permission for the file to be printed, but not for the text to be copied and pasted.
File(s): 
Senior supervisor: 
R
Department: 
Dept. of Statistics and Actuarial Science - Simon Fraser University
Thesis type: 
Thesis (Ph.D.)
Statistics: