Review of’s vitamin write-ups

There are a lot of vitamins and other supplements in the world, way more than I have time to investigate. has a pretty good reputation for its reports on vitamins and supplements. It would be extremely convenient for me if this reputation was merited. So I asked Martin Bernstoff to spot check some of their reports. 

We originally wanted a fairly thorough review of multiple Examine write-ups. Alas, Martin felt the press of grad school after two shallow reviews and had to step back. This is still enough to be useful so we wanted to share, but please keep in mind its limitations. And if you feel motivated to contribute checks of more articles, please reach out to me (

My (Elizabeth’s) tentative conclusion is that it would take tens of hours to beat an Examine general write-up, but they are not complete in either their list of topics nor their investigation into individual topics. If a particular effect is important to you, you will still need to do your own research.

Photo credit DALL-E


Vitamin B12

Claim: “The actual rate of deficiency [of B12] is quite variable and it isn’t fully known what it is, but elderly persons (above 65), vegetarians, or those with digestion or intestinal complications are almost always at a higher risk than otherwise healthy and omnivorous youth”

Verdict: True but not well cited. Their citation merely asserts that these groups have shortages rather than providing measurements, but Martin found a meta-analysis making the same claim for vegetarians (the only group he looked for).


Verdict: Very brief. Couldn’t find much on my own. Seems reasonable.

Claim: “Vitamin B12 can be measured in the blood by serum B12 concentrations, which is reproducible and reliable but may not accurately reflect bodily vitamin B12 stores (as low B12 concentrations in plasma or vitamin B12 deficiencies do not always coexist in a reliable manner[19][26][27]) with a predictive value being reported to be as low as 22%”

Verdict: True, the positive predictive value was 22%, but with a negative predictive value of 100% at the chosen threshold. But that’s only the numbers at one threshold. To know whether this is good or bad, we’d have to get numbers at different threshold (or, preferably, a ROC-AUC).

Claim: B12 supplements can improve depression

Examine reviews a handful of observational studies showing a correlation, but includes no RCTs.  This is in spite of there actually being RCTs like Koning et al. 2016 and a full meta analysis, neither of which find an effect. 

The lack of effect in RCTs is less damning than it sounds. I (Elizabeth) haven’t checked all of the studies, but the Koning study didn’t confine itself to subjects with low B12 and only tested serum B12 at baseline, not after treatment. So they have ruled out neither “low B12 can cause depression, but so do a lot of other things” nor “B12 can work but they used the wrong form”.

I still find it concerning that Examine didn’t even mention the RCTs, and I don’t have any reason to believe their correlational studies are any better. 

Interactions with pregnancy

Only one study on acute lymphoblastic leukemia. Seems a weird choice. Large meta-analyses exist for pre-term birth and low birth weight, likely much more important. Rogne et al. 2016.


They don’t seem to be saying much wrong but the write-up is not nearly as comprehensive as we had hoped. To give Examine its best shot, we decided the next vitamin should be on their best write-up. We tried asking Examine which article they are especially confident in. Unfortunately, whoever handles their public email address didn’t get the point after 3 emails, so Martin made his best guess. 

Vitamin D

Upper respiratory tract infections.

They summarize several studies but miss a very large RCT published in JAMA, the VIDARIS trial. All studies (including the VIDARIS trial) show no effect, so they might’ve considered the matter settled and stopped looking for more trials, which seems reasonable.

Claim: Vitamin D helps premenstrual syndrome

”Most studies have found a decrease in general symptoms when given to women with vitamin D deficiency, some finding notable reductions and some finding small reductions. It’s currently not known why studies differ, and more research is needed”

This summary seemed optimistic after Martin looked into the studies:

  • Abdollahi 2019:
    • No statistically significant differences between groups.
    • The authors highlight statistically significant decreases for a handful of symptoms in the Vitamin D group, but the decrease is similar in magnitude to placebo. Vitamin D and placebo both have 5 outcomes which were statistically significant.
  • Dadkhah 2016:
    • No statistically significant differences between treatment groups
  • Bahrami 2018:
    • No control group
  • Heidari 2019:
    • Marked differences between groups, but absolutely terrible reporting by the authors – they don’t even mention this difference in the abstract. This makes me (Martin) somewhat worried about the results – if they knew what they were doing, they’d focus the abstract on the difference in differences.:
  • Tartagni 2015:
    • Appears to show notable differences between groups, But terrible reporting. Tests change relative to baseline (?!), rather than differences in trends or differences in differences. 

In conclusion, only the poorest research finds effects – not a great indicator of a promising intervention. But Examine didn’t miss any obvious studies.

Claim: “There is some evidence that vitamin D may improve inflammation and clinical symptoms in COVID-19 patients, but this may not hold true with all dosing regimens. So far, a few studies have shown that high dosages for 8–14 days may work, but a single high dose isn’t likely to have the same benefit.”

The evidence Martin found seems to support their conclusions. They’re missing one relatively large, recent study (De Niet 2022). More importantly, all included studies are about hospital patients given vitamin D after admission, which are useless for determining if Vitamin D is a good preventative, especially because some forms of vitamin D take days to be turned into a useful form in the body. 

  • Murai 2021:
    • The regimen was a single, high dose at admission.
    • No statistically significant differences between groups, all the effect sizes are tiny or non-existent.
  • Sabico 2021:
    • Compares Vitamin D 5000 IU/daily to 1000 IU/daily in hospitalized patients.
    • In the Vitamin D group, they show faster
      • Time to recovery (6.2 ± 0.8 versus 9.1 ± 0.8; p = 0.039)
      • Time to restoration of taste (11.4 ± 1.0 versus 16.9 ± 1.7; p = 0.035)
        • The Kaplan-Meier Plot looks weird here, though. What happens on day 14?!
    • All symptom durations, except sore throat, were lower in the 5000 IU group:

All analyses were adjusted for age, BMI and type of D vitamin – which is a good thing, because it appears the 5000 IU group was healthier at baseline:

  • Castillo 2020:
    • Huge effect – half of the control group had to go to the ICU, whereas only one person in the intervention group did so (OR 0.02).
    • Nothing apparently wrong, but I’m still highly suspicious of the study:
      • An apparently well-done randomized pilot trial, early on, published in “The Journal of Steroid Biochemistry and Molecular Biology”. Very worrying that it isn’t published somewhere more prestigious.
      • They gave hydroxychloroquine as the “best available treatment”, even though there was no evidence of effect at the time of the study.
      • They call the study “double masked” – I hope this means double-blinded, because otherwise the study is close to worthless since their primary outcomes are based on doctor’s behavior.
      • The follow-up study is still recruiting.


I don’t know of a better comprehensive resource than It is alas still not comprehensive enough for important use cases, but still a useful shortcut for smaller problems.

Thanks to the FTX Regrant program for funding this post, and Martin for doing most of the work.

Guesstimate Algorithm for Medical Research

This document is aimed at subcontractors doing medical research for me. I am sharing it in the hope it is more broadly useful, but have made no attempts to make it more widely accessible. 


Guesstimate is a tool I have found quite useful in my work, especially in making medical estimates in environments of high uncertainty. It’s not just that it makes it easy to do calculations incorporating many sources of data; guesstimate renders your thinking much more legible to readers, who can then more productively argue with you about your conclusions. 

The basis of guesstimate is breaking down a question you want an answer to (such as “what is the chance of long covid?”) into subquestions that can be tackled independently. Questions can have numerical answers in the form of a single number, a range, or a formula that references other questions. This allows you to highlight areas of relative certainty and relative uncertainty, to experiment with the importance of different assumptions, and for readers to play with your model and identify differences of opinion while incorporating the parts of your work they agree with.


If you’re not already familiar with guesstimate, please watch this video, which references this model. The video goes over two toy questions to help you familiarize yourself with the interface.

The Algorithm

The following is my basic algorithm for medical questions:

  1. Formalize the question you want an answer to. e.g. what is the risk to me of long covid?
  2. Break that question down into subquestions. The appropriate subquestion varies based on what data is available, and your idea of the correct subquestions is likely to change as you work.
    1. When I was studying long covid last year, I broke it into the following subquestions
      1. What is the risk with baseline covid?
      2. What is the vaccine risk modifier?
      3. What is the strain risk modifier?
      4. What’s the risk modifier for a given individual?
  3. In guesstimate, wire the questions together. For example, if you wanted to know your risk of hospitalization when newly vaccinated in May 2021, you might multiply the former hospitalization rate times a vaccine modifier. If you don’t know how to do that in guesstimate, watch the video above, it demonstrates it in a lot of detail.
  4. Use literature to fill in answers to subquestions as best you can. Unless the data is very good, these probably include giving ranges and making your best guess as to the shape of the distribution of values.
    1. Provide citations for where you got those numbers. This can be done in the guesstimate commenting interface, but that’s quite clunky. Sometimes it’s better to have a separate document where you lay out your reasoning. 
    2. The reader should be able to go from a particular node in the guesstimate to your reasoning for that node with as little effort as possible.
    3. Guesstimate will use log-normal distribution by default, but you can change it to uniform or normal if you believe that represents reality better.
  5. Sometimes there are questions literature literally can’t answer, or aren’t worth your time to research rigorously. Make your best guess, and call it out as a separate variable so people can identify it and apply their own best guess.
    1. This includes value judgments, like the value of a day in lockdown relative to a normal day, or how much one hates being sick.
    2. Or the 5-year recovery rate from long covid- no one can literally measure it, and while you could guess from other diseases, the additional precision isn’t necessarily worth the effort.
  6. Final product is both the guesstimate model and a document writing up your sources and reasoning.

Example: Trading off air quality and covid.

The final model is available here.

Every year California gets forest fires big enough to damage air quality even if you are quite far away, which pushes people indoors. This was mostly okay until covid, which made being indoors costly in various ways too. So how do we trade those off? I was particularly interested in trading off outdoor exercise vs the gym (and if both had been too awful I might have rethought my stance on how boring and unpleasant working out in my tiny apartment is).

What I want to know is the QALY hit from 10 minutes outdoors vs 10 minutes indoors. This depends a lot on the exact air quality and covid details for that particular day, so we’ll need to have variables for that.

For air quality, I used the calculations from this website to turn AQI into cigarettes. I found a cigarette -> micromort converter faster than cigarette -> QALY so I’m just going to use that. This is fine as long as covid and air quality have the same QALY:micromort ratio (unlikely) or if the final answer is clear enough that even large changes in the ratio would not change our decision (judgment call). 

For both values that use outside data I leave a comment with the source, which gives them a comment icon in the upper right corner.

But some people are more susceptible than others due to things like asthma or cancer, so I’ll add a personal modifier.  I’m not attempting to define this well: people with lung issues can make their best guess. They can’t read my mind though, so I’ll make it clear that 1=average and which direction is bad.

Okay how about 10 minutes inside? That depends a lot on local conditions. I could embed those all in my guesstimate, or I could punt to microcovid. I’m not sure if microcovid is still being maintained but I’m very sure I don’t feel like creating new numbers right now, so we’ll just do that. I add a comment with basic instructions.

How about microcovids to micromorts? The first source I found said 10k per infection, which is a suspiciously round number but it will do for now. I device the micromorts by 1 million, since each microcovid is 1/1,000,000 chance of catching covid.

They could just guess their personal risk modifier like they do for covid, or they could use this (pre-vaccine, pre-variant) covid risk calculator from the Economist, so I’ll leave a note for that.

But wait- there are two calculations happening in the microcovids -> micromorts cell, which makes it hard to edit if you disagree with me about the risk of covid. I’m going to move the /1,000,000 to the top cell so it’s easy to edit.

But the risk of catching covid outside isn’t zero. Microcovid says outdoors has 1/20th the risk. I’m very sure that’s out of date but don’t know the new number so I’ll make something up and list it separately so it’s easy to edit

But wait- I’m not necessarily with the same people indoors and out. The general density of people is comparable if I’m deciding to throw a party inside or outside, but not if I’m deciding to exercise outdoors or at a gym. So I should make that toggleable.

Eh, I’m still uncomfortable with that completely made up outdoor risk modifier. Let’s make it a range so we can see the scope of possible risks. Note that this only matters if we’re meeting people outdoors, which seems correct.

But that used guesstimate’s default probability distribution (log normal). I don’t see a reason probability density would concentrate at the low end of the distribution, so I switch it to normal.

Turns out to make very little difference in practice.

There are still a few problems here. Some of the numbers are more or less made up, and others have sources but I’ve done no work to verify them, which is almost as bad.

But unless the numbers are very off, covid is a full order of magnitude riskier than air pollution for the scenarios I picked. This makes me disinclined to spend a bunch of time tracking down better numbers.

Full list of limitations:

  • Only looks at micromorts, not QALYs
  • Individual adjustment basically made up, especially for pollution
  • Several numbers completely made up
  • Didn’t check any of my sources

Example: Individual’s chance of long covid given infection

This will be based on my post last year, Long covid is not necessarily your biggest problem, with some modification for pedagogical purposes. And made up numbers instead of real ones because the specific numbers have long been eclipsed by new data and strains. The final model is available here

Step one is to break your questions into subquestions. When I made this model a year ago, we only had data for baseline covid in unvaccinated people. Everyone wanted to know how vaccinations and the new strain would affect things. 

My first question was “can we predict long covid from acute covid?” I dug into the data and concluded “Yes”, the risk of long covid seemed to be very well correlated with acute severity. This informed the shape of the model but not any particular values. Some people disagreed with me, and they would make a very different model. 

Once I made that determination, the model was pretty easy to create: It looked like [risk of hospitalization with baseline covid] * [risk of long covid given hospitalization rate] * [vaccination risk modifier] * [strain hospitalization modifier] * [personal risk modifier]. Note that the model I’m creating here does not perfectly match the one created last year; I’ve modified it to be a better teaching example. 

The risk of hospitalization is easy to establish unless you start including undetected/asymptomatic cases. This has become a bigger deal as home tests became more available and mild cases became more common, since government statistics are missing more mild or asymptomatic cases. So in my calculation, I broke down the risk of hospitalization given covid to the known case hospitalization rate and then inserted a separate term based on my estimate of the number of uncaught cases. In the original post I chose some example people and used base estimates for them from the Economist data. In this model, I made something up.

Honestly, I don’t remember how I calculated the risk of long covid given the hospitalization rate. It was very complicated and a long time ago. This is why I write companion documents to explain my reasoning. 

Vaccination modifier was quite easy, every scientist was eager to tell us that. However, there are now questions about vaccines waning over time, and an individual’s protection level is likely to vary. Because of that, in this test model I have entered a range of vaccine efficacies, rather than a point estimate. An individual who knew how recently they were vaccinated might choose to collapse that down. 

Similarly, strain hospitalization modifiers take some time to assess, but are eventually straightforwardly available. Your estimate early in a new strain will probably have a much wider confidence interval than your estimate late in the same wave. 

By definition, I can’t set the personal risk modifier for every person looking at the model. I suggested people get a more accurate estimate of their personal risk using the Economist calculator, and then enter that in the model.

Lastly, there is a factor I called “are you feeling lucky?”. Some people don’t have anything diagnosable but know they get every cold twice; other people could get bitten by a plague rat with no ill effects. This is even more impossible to provide for an individual but is in fact pretty important for an individual’s risk assessment, so I included it as a term in the model. Individuals using the model can set it as they see fit, including to 1 if they don’t want to think about it.

When I put this together, I get this guesstimate. [#TODO screenshot]. Remember the numbers are completely made up. If you follow the link you can play around with it yourself, but your changes will not be saved. If anyone wants to update my model with modern strains and vaccine efficacy, I would be delighted.

Tips and Tricks

I’m undoubtedly missing many, so please comment with your own and I’ll update or create a new version later.

When working with modifiers, it’s easy to forget whether a large number is good or bad, and what the acceptable range is. It can be good to mark them with “0 to 1, higher is less risky”, or “between 0 and 1 = less risk, >1 = more risk”

If you enter a range, the default distribution is log-normal. If you want something different, change it. 

The formulas in the cells can get almost arbitrarily complicated, although it’s often not worth it. 

No, seriously, write out your sources and reasoning somewhere else because you will come back in six months and not remember what the hell you were thinking. Guesstimate is less a tool for holding your entire model and more a tool for forcing you to make your model explicit.

Separate judgment calls from empirical data, even if you’re really sure you are right. 


Thanks to Ozzie Gooen and his team for creating Guesstimate.

Thanks to the FTX Regrant program and a shy regrantor for funding this work.