Synthetic data method to incorporate external information into a current study. Issue 4 (26th June 2019)
- Record Type:
- Journal Article
- Title:
- Synthetic data method to incorporate external information into a current study. Issue 4 (26th June 2019)
- Main Title:
- Synthetic data method to incorporate external information into a current study
- Authors:
- Gu, Tian
Taylor, Jeremy M. G.
Cheng, Wenting
Mukherjee, Bhramar - Abstract:
- Abstract : We consider the situation where there is a known regression model that can be used to predict an outcome, Y, from a set of predictor variables X . A new variable B is expected to enhance the prediction of Y . A dataset of size n containing Y, X and B is available, and the challenge is to build an improved model for Y |X, B that uses both the available individual level data and some summary information obtained from the known model for Y |X . We propose a synthetic data approach, which consists of creating m additional synthetic data observations, and then analyzing the combined dataset of size n + m to estimate the parameters of the Y |X, B model. This combined dataset of size n + m now has missing values of B for m of the observations, and is analyzed using methods that can handle missing data (e.g., multiple imputation). We present simulation studies and illustrate the method using data from the Prostate Cancer Prevention Trial. Though the synthetic data method is applicable to a general regression context, to provide some justification, we show in two special cases that the asymptotic variances of the parameter estimates in the Y |X, B model are identical to those from an alternative constrained maximum likelihood estimation approach. This correspondence in special cases and the method's broad applicability makes it appealing for use across diverse scenarios. The Canadian Journal of Statistics 47: 580–603; 2019 © 2019 Statistical Society of Canada Résumé :Abstract : We consider the situation where there is a known regression model that can be used to predict an outcome, Y, from a set of predictor variables X . A new variable B is expected to enhance the prediction of Y . A dataset of size n containing Y, X and B is available, and the challenge is to build an improved model for Y |X, B that uses both the available individual level data and some summary information obtained from the known model for Y |X . We propose a synthetic data approach, which consists of creating m additional synthetic data observations, and then analyzing the combined dataset of size n + m to estimate the parameters of the Y |X, B model. This combined dataset of size n + m now has missing values of B for m of the observations, and is analyzed using methods that can handle missing data (e.g., multiple imputation). We present simulation studies and illustrate the method using data from the Prostate Cancer Prevention Trial. Though the synthetic data method is applicable to a general regression context, to provide some justification, we show in two special cases that the asymptotic variances of the parameter estimates in the Y |X, B model are identical to those from an alternative constrained maximum likelihood estimation approach. This correspondence in special cases and the method's broad applicability makes it appealing for use across diverse scenarios. The Canadian Journal of Statistics 47: 580–603; 2019 © 2019 Statistical Society of Canada Résumé : Les auteurs considèrent la situation où un modèle de régression connu peut être utilisé pour prédire une réponse Y à partir des prédicteurs X . Une nouvelle variable B devrait permettre d'améliorer les prévisions de Y . Un jeu de données de taille n comportant les variables Y, X et B est disponible, et le défi consiste à construire un modèle amélioré pour Y |X, B qui s'appuie sur les données individuelles, mais aussi sur de l'information sommaire du modèle connu Y |X . Les auteurs proposent de générer m nouvelles données synthétiques, puis d'analyser le jeu de données combiné de n + m données pour estimer les paramètres du modèle Y |X, B . Le jeu de données combiné compte n + m données dont m comportant des valeurs manquantes pour B, et son analyse fait donc appel à des méthodes appropriées (comme l'imputation). Les auteurs présentent des études de simulation et illustrent leur méthode avec des données réelles du Prostate Cancer Prevention Trial. Même si l'approche par données synthétiques s'applique dans un contexte général de régression, les auteurs décrivent deux cas spéciaux pour lesquels la variance asymptotique des estimateurs des paramètres dans le modèle Y |X, B est identique à celle d'une approche au maximum de vraisemblance sous contraintes. Cette correspondance dans des cas spéciaux et la vaste applicabilité de la méthode en font un choix intéressant pour divers scénarios. La revue canadienne de statistique 47: 580–603; 2019 © 2019 Société statistique du Canada … (more)
- Is Part Of:
- Canadian journal of statistics. Volume 47:Issue 4(2019)
- Journal:
- Canadian journal of statistics
- Issue:
- Volume 47:Issue 4(2019)
- Issue Display:
- Volume 47, Issue 4 (2019)
- Year:
- 2019
- Volume:
- 47
- Issue:
- 4
- Issue Sort Value:
- 2019-0047-0004-0000
- Page Start:
- 580
- Page End:
- 603
- Publication Date:
- 2019-06-26
- Subjects:
- Constrained maximum likelihood -- data integration -- prediction models -- synthetic data
Mathematical statistics -- Periodicals
519.5 - Journal URLs:
- http://archimede.mat.ulaval.ca/cjs/ ↗
http://onlinelibrary.wiley.com/journal/10.1002/(ISSN)1708-945X/issues ↗
http://www.jstor.org/journals/03195724.html ↗
http://onlinelibrary.wiley.com/ ↗
http://www.ingentaconnect.com/content/ssc/cjs ↗
http://www.mat.ulaval.ca/rcs/indexe.shtml ↗ - DOI:
- 10.1002/cjs.11513 ↗
- Languages:
- English
- ISSNs:
- 0319-5724
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3035.760000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 16244.xml