An overview of techniques for dealing with large numbers of independent variables in epidemiologic studies.
Résumé
Many studies of health and production problems in livestock involve the simultaneous evaluation of large numbers of risk factors. These analyses may be complicated by a number of problems including: multicollinearity (which arises because many of the risk factors may be related (correlated) to each other), confounding, interaction, problems related to sample size (and hence the power of the study), and the fact that many associations are evaluated from a single dataset. This paper focuses primarily on the problem of multicollinearity and discusses a number of techniques for dealing with this problem. However, some of the techniques discussed may also help to deal with the other problems identified above. The first general approach to dealing with multicollinearity involves reducing the number of independent variables prior to investigating associations with the disease. Techniques to accomplish this include: (1) excluding variables after screening for associations among independent variables; (2) creating indices or scores which combine data from multiple factors into a single variable; (3) creating a smaller set of independent variables through the use of multivariable techniques such as principal components analysis or factor analysis. The second general approach is to use appropriate steps and statistical techniques to investigate associations between the independent variables and the dependent variable. A preliminary screening of these associations may be performed using simple statistical tests. Subsequently, multivariable techniques such as linear or logistic regression or correspondence analysis can be used to identify important associations. The strengths and limitations of these techniques are discussed and the techniques are demonstrated using a dataset from a recent study of risk factors for pneumonia in swine. Emphasis is placed on comparing correspondence analysis with other techniques as it has been used less in the epidemiology literature.