Join Dr. Ana Diez-Roux, for SERforum Live!
May 1, 12pm EST


Do you ever find yourself struggling to figure out a question about epidemiologic methods, or other topics in epidemiology, and don’t know who to ask? The SERforum allows for individuals to answer questions that come up in our daily work around substantive and methodological topics in epidemiology.
All topics may be viewed, but to read and post comments, SER membership is required. If you are member, login! Not a member, join us!

You need to log in to create posts and topics.

Multiple sources of imperfect data - how to combine?

Let's say we have two sources of data, medical records and self-report.  Many more people have self-report (n=2000) than medical records (n=250), but some have both (n=200).  Based on other studies, we expect the medical records data to be good but not perfect, and we're studying something that self-report is generally ok on, but again, not perfect.  Comparison of medical records and self-report within the data confirms this: most of the time they agree, but in some cases, one or the other record is clearly correct, and some of the time, it's not clear.

How best to use the combined data?

a) Create a "best guess" record that uses both sources for those that have it, and only one source for those that don't? (Idea is, maximizes sample size and reduces error)

b) Stick to one source or the other, even if the other indicates an error? (consistent across participants)

c) Use only those for whom both sources are available and they agree or one is clearly better? (minimizes measurement error, but also sample size)

d) Some sort of validation algorithm/regression calibration/adjustment for agreement?

e) A or B, plus sensitivity analysis based on agreement data?

I'd love to hear other opinions, but I am dealing with almost exactly the same situation now and using Bayesian latent class models. They seem to be a good approach in the absence of a gold standard measurement, and can accommodate uneven observations like you describe.

It is unclear what is being collected in these two data sources; is it baseline covariates, exposure status (this is what I am thinking), or outcome data?

Since the sources are imperfect, it does not quite line up with a validation subsample. Though it does allow you to examine where the sources may overlap in classification (if this is a classification problem). But you state this may be unclear. Also, approaches may depend on the effect size you are trying to estimate and if there is any systematic bias in who has both data source elements. If this is a classification problem, it reminds me of stack ensemble where you may use a majority voting method based on both sources. However you have an even number of sources, so you would want to weight them based on their individual accuracies or tendencies. This would have been ideally accomplished with a holdout dataset, but given your sample size for individuals with both sources - I would guess this may not be an option.

I would second the possible alignment with Bayesian approaches. In particular (since I am not overly familiar with Bayesian latent class models), in an exploratory fashion what pops in my head is the use of repeated but semi-independent screening tools. Say you screen subjects with one tool then a second instrument, updating prior probabilities along the way.

Though I feel as though I am rambling a bit and possibly off track, since I am unclear on what data elements are being taken from your unique data sources.