The Balto/Togo theory of scientific development

Tragically I gave up on the Plate Tectonics study before answering my most important question: “Is Alfred Wegener the Balto of plate tectonics?”

Let me back up.


Balto is a famous sled dog. He got a statue in NYC for leading a team of dogs through a blizzard to deliver antibody serum to Nome, Alaska in 1925, ending a diphtheria outbreak. Later Disney made a movie about how great he was.

Except that run was a relay, and Balto only got famous because he did the last leg, which had the most press coverage but was also the easiest. The real hero was Togo, the dog who led the team through the hardest terrain and covered by far the most miles as well. Disney later made a movie about him that makes no mention of Balto for the first 90%, and then goes out of its way to talk about what a shit dog he was, that’s why he didn’t get included in any of the important teams, but Togo had had to do so many hard things they needed a backup team for the trivial last leg so Balto would have to do.

Togo’s owner died mad about the US mainland believing Balto was a hero. But since all the breeders knew who did the hard part Togo enjoyed a post-Nome level of reproductive success that Ghengis Khan could only dream about, so I feel like he was happy with his choices.

plus he did eventually get some statues

But it’s not like Togo did this alone either. He led one team in a relay, and there were 20 humans and 150 dogs that contributed to the overall run. Plus someone had to invent the serum, manufacture it, and get it to the start of the dog relay at Nenana, Alaska. So exactly how much credit should Togo get here?

The part with Wegener

I was pretty sure Alfred Wegener, popularly credited as the discoverer/inventor of continental drift and mentioned more prominently than any other scientist in discussions of plate tectonics, is a Balto.

First of all, continental drift is not plate tectonics. Continental drift is an idea that maybe some stuff happened one time. Plate tectonics is a paradigm with a mechanism that makes predictions and explains a lot of data no one knew was related until that moment.

Second, Wegener didn’t discover any of the evidence he cited, he wasn’t the first to have the idea, and it’s not even clear he did much of the synthesis of the evidence. His original paper refers to “Concerning South America and Africa, biologists and geologists are in close agreement that a Brazilian–African continent existed in the Mesozoic”

So he didn’t invent the idea, gather the data, or even really synthesize the evidence. His guess at the mechanism was wrong. But despite spending hours digging into the specific discovers and synthesizers that contributed to plate tectonics, the only name I remember is Wegener’s. Classic Balto.

On the other hand, some of the people who gathered the data used to discover plate tectonics were motivated by the concept of continental drift, and by Wegener specifically. That seems like it should count for something. My collaborator Jasen Murray thinks it counts for a lot

Jasen would go so far as to argue that shining a beacon in unknown territory that inspires explorers to look for treasure in the right place makes you the Togo, racing through fractured ice rapids social ridicule and self-doubt to do the real work of getting an idea considered at all. Showing up at the finish line to formalize a theory after there’s enough work to know it’s true is Balto work to him. This makes me profoundly uncomfortable because strongly advocating for something unproven terrifies me, but as counterargument arguments go that’s pretty weak.

One difficulty is it’s hard to distinguish “ahead of their time beacon shining” from “lucky idiot”, and even Jasen admits he doesn’t know enough to claim Wegener in particular is a Togo. But doing work that is harder to credit because it’s less legible is also very Togo-like behavior, so this proves nothing about the category. 

So I guess one of my new research questions is “how important are popularizers?” and I hate it.

Dependency Tree For The Development Of Plate Tectonics

This post is really rough and mostly meant to refer back to when I’ve produced more work on the subject. Proceed at your own risk.


As I mentioned a few weeks ago I am working on a project on how scientific paradigms are developed. I generated a long list of questions and picked plate tectonics as my first case study. I immediately lost interest in the original questions and wanted to make a dependency graph/tech tree for the development of the paradigm, and this is just a personal project so I did that instead.

I didn’t reach a stopping point with this graph other than “I felt done and wanted to start on my second case study”. I’m inconsistent about the level of detail or how far back I go. I tried to go back and mark whether data collection was motivated by theory or practical issues but didn’t fill it in for every node, even when it was knowable. Working on a second case study felt more useful than refining this one further so I’m shipping this version. 

“Screw it I’m shipping” is kind of the theme of this blog post, but that’s partially because I’m not sure which things are most valuable. Questions, suggestions, or additions are extremely welcome as they help me narrow in on the important parts. But heads up the answer might be “I don’t remember and don’t think it’s important enough to look up”. My current intention is to circle back after 1 or 2 more case studies and do some useful compare and contrast, but maybe I’ll find something better.

(Readable version here)

And if you’re really masochistic, here’s the yEd file to play with.

Scattered Thoughts

Why I chose plate tectonics

  • It’s recent enough to have relatively good documentation, but not so recent the major players are alive and active in field politics.
  • It’s not a sexy topic, so while there isn’t much work on it what exists is pretty high quality. 
  • It is *the only* accepted paradigm in its field (for the implicit definition of paradigm in my head).
  • Most paradigms are credited to one person on Wikipedia, even though that one person needed many other people’s work and the idea was refined by many people after they created it. Plate tectonics is the first I’ve found that didn’t do that. Continental drift is attributed to Alfred Wegener, but continental drift is not plate tectonics. Plate tectonics is acknowledged as so much of a group effort wikipedia doesn’t give anyone’s name.

Content notes

  • This graph is based primarily on Plate Tectonics: An Insider’s History of the Modern Theory of the Earth, edited by Naomi Oreskes. It also includes parts from this lecture by Christopher White, and Oxford’s Very Short Introduction to Plate Tectonics.
  • Sources vary on how much Alfred Wegener knew when he proposed continental drift. Some say he only had the fossil and continental shape data, but the White video says he had also used synchronous geological layers and evidence of glacial travel.
    • I tried to resolve this by reading Wegener’s original paper (translated into English) but it only left me more confused. He predicted cracks in plates being filled in by magma, but only mentions fossils once. Moreover he only brings them up to point to fossils of plants that are clearly maladapted to the climate of their current location, not the transcontinental weirdnesses. He does casually mention “Concerning South America and Africa, biologists and geologists are in close agreement that a Brazilian–African continent existed in the Mesozoic”, but clearly he’s not the first one to make that argument.
    • I alas ran out of steam before trying Wegener’s book.
    • I was stymied in attempts to check his references by the fact that they’re in German. If you really love reading historic academic German and would like to pair on this, please let me know.
    • I stuck to just the fossil + fit data in the graph, because White is ambiguous when he’s talking about data Wegener had vs. data that came later.
    • White says the bathymetry maps showing the continental shelves had a much better fit than the continents themselves didn’t come out until after Wegener had published, but this paper cites sufficiently detailed maps of North America’s sea floor in 1884. It’s possible no one bothered with South America and Africa until later.
  • A lot of the data for plate tectonics fell out of military oceanography research. Some of the tools used for this were 100+ years old. Others were recently invented (in particular, magnetometers and gravimeters that worked at sea), but the tech those inventions relied on was not that recent. I think. It’s possible a motivated person could have gathered all the necessary evidence much earlier.
  • Sources also vary a lot on what they thought was relevant. The White video uses continental shelf fit (which is much more precise than using the visible coastline) as one of the five pillars of evidence, but it didn’t come up in the overview chapter of the Oreskes book at all.
  • This may be because evidence of continental drift (that is, that the continents used to be in different places, sometimes touching each other) is very different than evidence for plate tectonics (which overwhelmingly focuses on the structure of the plates and mechanism of motion). 

Process notes

  • At points my research got very bogged down in some of the specifics of plate tectonics (in particular, why were transform faults always shown perpendicular to mountain ridges, and how there could be so many parallel to each other?). This ended up being quite time consuming because I was in that dead zone where the question was too advanced for 101 resources to answer but advanced resources assumed you already knew the answer. In the end I had to find a human tutor.
  • This could clearly be infinitely detailed or go infinitely far back. I didn’t have a natural “done” condition beyond feeling bored and wanting to do something else. 
  • I only got two chapters into Oreskes and ⅔ through Very Short Introduction. 
  • I didn’t keep close track but this probably represents 20 hours of work, maybe closer to 30 with a more liberal definition of work. Plus 5-10 hours from other people.
  • In calendar time it was ~7 weeks from starting the Oreskes book to scheduling this for publishing.
  • You can see earlier drafts of the graph, along with some of my notes, on Twitter.


Thanks to several friends and especially Jasen Murray for their suggestions and questions, and half the people I’ve talked to in the last six weeks for tolerating this topic.

Thanks to Emily Arnold for spending an hour answering my very poorly phrased questions about transform faults.

Thanks to my Patreon patrons for supporting this work, you guys get a fractional impact share.

Review of’s vitamin write-ups

There are a lot of vitamins and other supplements in the world, way more than I have time to investigate. has a pretty good reputation for its reports on vitamins and supplements. It would be extremely convenient for me if this reputation was merited. So I asked Martin Bernstoff to spot check some of their reports. 

We originally wanted a fairly thorough review of multiple Examine write-ups. Alas, Martin felt the press of grad school after two shallow reviews and had to step back. This is still enough to be useful so we wanted to share, but please keep in mind its limitations. And if you feel motivated to contribute checks of more articles, please reach out to me (

My (Elizabeth’s) tentative conclusion is that it would take tens of hours to beat an Examine general write-up, but they are not complete in either their list of topics nor their investigation into individual topics. If a particular effect is important to you, you will still need to do your own research.

Photo credit DALL-E


Vitamin B12

Claim: “The actual rate of deficiency [of B12] is quite variable and it isn’t fully known what it is, but elderly persons (above 65), vegetarians, or those with digestion or intestinal complications are almost always at a higher risk than otherwise healthy and omnivorous youth”

Verdict: True but not well cited. Their citation merely asserts that these groups have shortages rather than providing measurements, but Martin found a meta-analysis making the same claim for vegetarians (the only group he looked for).


Verdict: Very brief. Couldn’t find much on my own. Seems reasonable.

Claim: “Vitamin B12 can be measured in the blood by serum B12 concentrations, which is reproducible and reliable but may not accurately reflect bodily vitamin B12 stores (as low B12 concentrations in plasma or vitamin B12 deficiencies do not always coexist in a reliable manner[19][26][27]) with a predictive value being reported to be as low as 22%”

Verdict: True, the positive predictive value was 22%, but with a negative predictive value of 100% at the chosen threshold. But that’s only the numbers at one threshold. To know whether this is good or bad, we’d have to get numbers at different threshold (or, preferably, a ROC-AUC).

Claim: B12 supplements can improve depression

Examine reviews a handful of observational studies showing a correlation, but includes no RCTs.  This is in spite of there actually being RCTs like Koning et al. 2016 and a full meta analysis, neither of which find an effect. 

The lack of effect in RCTs is less damning than it sounds. I (Elizabeth) haven’t checked all of the studies, but the Koning study didn’t confine itself to subjects with low B12 and only tested serum B12 at baseline, not after treatment. So they have ruled out neither “low B12 can cause depression, but so do a lot of other things” nor “B12 can work but they used the wrong form”.

I still find it concerning that Examine didn’t even mention the RCTs, and I don’t have any reason to believe their correlational studies are any better. 

Interactions with pregnancy

Only one study on acute lymphoblastic leukemia. Seems a weird choice. Large meta-analyses exist for pre-term birth and low birth weight, likely much more important. Rogne et al. 2016.


They don’t seem to be saying much wrong but the write-up is not nearly as comprehensive as we had hoped. To give Examine its best shot, we decided the next vitamin should be on their best write-up. We tried asking Examine which article they are especially confident in. Unfortunately, whoever handles their public email address didn’t get the point after 3 emails, so Martin made his best guess. 

Vitamin D

Upper respiratory tract infections.

They summarize several studies but miss a very large RCT published in JAMA, the VIDARIS trial. All studies (including the VIDARIS trial) show no effect, so they might’ve considered the matter settled and stopped looking for more trials, which seems reasonable.

Claim: Vitamin D helps premenstrual syndrome

”Most studies have found a decrease in general symptoms when given to women with vitamin D deficiency, some finding notable reductions and some finding small reductions. It’s currently not known why studies differ, and more research is needed”

This summary seemed optimistic after Martin looked into the studies:

  • Abdollahi 2019:
    • No statistically significant differences between groups.
    • The authors highlight statistically significant decreases for a handful of symptoms in the Vitamin D group, but the decrease is similar in magnitude to placebo. Vitamin D and placebo both have 5 outcomes which were statistically significant.
  • Dadkhah 2016:
    • No statistically significant differences between treatment groups
  • Bahrami 2018:
    • No control group
  • Heidari 2019:
    • Marked differences between groups, but absolutely terrible reporting by the authors – they don’t even mention this difference in the abstract. This makes me (Martin) somewhat worried about the results – if they knew what they were doing, they’d focus the abstract on the difference in differences.:
  • Tartagni 2015:
    • Appears to show notable differences between groups, But terrible reporting. Tests change relative to baseline (?!), rather than differences in trends or differences in differences. 

In conclusion, only the poorest research finds effects – not a great indicator of a promising intervention. But Examine didn’t miss any obvious studies.

Claim: “There is some evidence that vitamin D may improve inflammation and clinical symptoms in COVID-19 patients, but this may not hold true with all dosing regimens. So far, a few studies have shown that high dosages for 8–14 days may work, but a single high dose isn’t likely to have the same benefit.”

The evidence Martin found seems to support their conclusions. They’re missing one relatively large, recent study (De Niet 2022). More importantly, all included studies are about hospital patients given vitamin D after admission, which are useless for determining if Vitamin D is a good preventative, especially because some forms of vitamin D take days to be turned into a useful form in the body. 

  • Murai 2021:
    • The regimen was a single, high dose at admission.
    • No statistically significant differences between groups, all the effect sizes are tiny or non-existent.
  • Sabico 2021:
    • Compares Vitamin D 5000 IU/daily to 1000 IU/daily in hospitalized patients.
    • In the Vitamin D group, they show faster
      • Time to recovery (6.2 ± 0.8 versus 9.1 ± 0.8; p = 0.039)
      • Time to restoration of taste (11.4 ± 1.0 versus 16.9 ± 1.7; p = 0.035)
        • The Kaplan-Meier Plot looks weird here, though. What happens on day 14?!
    • All symptom durations, except sore throat, were lower in the 5000 IU group:

All analyses were adjusted for age, BMI and type of D vitamin – which is a good thing, because it appears the 5000 IU group was healthier at baseline:

  • Castillo 2020:
    • Huge effect – half of the control group had to go to the ICU, whereas only one person in the intervention group did so (OR 0.02).
    • Nothing apparently wrong, but I’m still highly suspicious of the study:
      • An apparently well-done randomized pilot trial, early on, published in “The Journal of Steroid Biochemistry and Molecular Biology”. Very worrying that it isn’t published somewhere more prestigious.
      • They gave hydroxychloroquine as the “best available treatment”, even though there was no evidence of effect at the time of the study.
      • They call the study “double masked” – I hope this means double-blinded, because otherwise the study is close to worthless since their primary outcomes are based on doctor’s behavior.
      • The follow-up study is still recruiting.


I don’t know of a better comprehensive resource than It is alas still not comprehensive enough for important use cases, but still a useful shortcut for smaller problems.

Thanks to the FTX Regrant program for funding this post, and Martin for doing most of the work.

Guesstimate Algorithm for Medical Research

This document is aimed at subcontractors doing medical research for me. I am sharing it in the hope it is more broadly useful, but have made no attempts to make it more widely accessible. 


Guesstimate is a tool I have found quite useful in my work, especially in making medical estimates in environments of high uncertainty. It’s not just that it makes it easy to do calculations incorporating many sources of data; guesstimate renders your thinking much more legible to readers, who can then more productively argue with you about your conclusions. 

The basis of guesstimate is breaking down a question you want an answer to (such as “what is the chance of long covid?”) into subquestions that can be tackled independently. Questions can have numerical answers in the form of a single number, a range, or a formula that references other questions. This allows you to highlight areas of relative certainty and relative uncertainty, to experiment with the importance of different assumptions, and for readers to play with your model and identify differences of opinion while incorporating the parts of your work they agree with.


If you’re not already familiar with guesstimate, please watch this video, which references this model. The video goes over two toy questions to help you familiarize yourself with the interface.

The Algorithm

The following is my basic algorithm for medical questions:

  1. Formalize the question you want an answer to. e.g. what is the risk to me of long covid?
  2. Break that question down into subquestions. The appropriate subquestion varies based on what data is available, and your idea of the correct subquestions is likely to change as you work.
    1. When I was studying long covid last year, I broke it into the following subquestions
      1. What is the risk with baseline covid?
      2. What is the vaccine risk modifier?
      3. What is the strain risk modifier?
      4. What’s the risk modifier for a given individual?
  3. In guesstimate, wire the questions together. For example, if you wanted to know your risk of hospitalization when newly vaccinated in May 2021, you might multiply the former hospitalization rate times a vaccine modifier. If you don’t know how to do that in guesstimate, watch the video above, it demonstrates it in a lot of detail.
  4. Use literature to fill in answers to subquestions as best you can. Unless the data is very good, these probably include giving ranges and making your best guess as to the shape of the distribution of values.
    1. Provide citations for where you got those numbers. This can be done in the guesstimate commenting interface, but that’s quite clunky. Sometimes it’s better to have a separate document where you lay out your reasoning. 
    2. The reader should be able to go from a particular node in the guesstimate to your reasoning for that node with as little effort as possible.
    3. Guesstimate will use log-normal distribution by default, but you can change it to uniform or normal if you believe that represents reality better.
  5. Sometimes there are questions literature literally can’t answer, or aren’t worth your time to research rigorously. Make your best guess, and call it out as a separate variable so people can identify it and apply their own best guess.
    1. This includes value judgments, like the value of a day in lockdown relative to a normal day, or how much one hates being sick.
    2. Or the 5-year recovery rate from long covid- no one can literally measure it, and while you could guess from other diseases, the additional precision isn’t necessarily worth the effort.
  6. Final product is both the guesstimate model and a document writing up your sources and reasoning.

Example: Trading off air quality and covid.

The final model is available here.

Every year California gets forest fires big enough to damage air quality even if you are quite far away, which pushes people indoors. This was mostly okay until covid, which made being indoors costly in various ways too. So how do we trade those off? I was particularly interested in trading off outdoor exercise vs the gym (and if both had been too awful I might have rethought my stance on how boring and unpleasant working out in my tiny apartment is).

What I want to know is the QALY hit from 10 minutes outdoors vs 10 minutes indoors. This depends a lot on the exact air quality and covid details for that particular day, so we’ll need to have variables for that.

For air quality, I used the calculations from this website to turn AQI into cigarettes. I found a cigarette -> micromort converter faster than cigarette -> QALY so I’m just going to use that. This is fine as long as covid and air quality have the same QALY:micromort ratio (unlikely) or if the final answer is clear enough that even large changes in the ratio would not change our decision (judgment call). 

For both values that use outside data I leave a comment with the source, which gives them a comment icon in the upper right corner.

But some people are more susceptible than others due to things like asthma or cancer, so I’ll add a personal modifier.  I’m not attempting to define this well: people with lung issues can make their best guess. They can’t read my mind though, so I’ll make it clear that 1=average and which direction is bad.

Okay how about 10 minutes inside? That depends a lot on local conditions. I could embed those all in my guesstimate, or I could punt to microcovid. I’m not sure if microcovid is still being maintained but I’m very sure I don’t feel like creating new numbers right now, so we’ll just do that. I add a comment with basic instructions.

How about microcovids to micromorts? The first source I found said 10k per infection, which is a suspiciously round number but it will do for now. I device the micromorts by 1 million, since each microcovid is 1/1,000,000 chance of catching covid.

They could just guess their personal risk modifier like they do for covid, or they could use this (pre-vaccine, pre-variant) covid risk calculator from the Economist, so I’ll leave a note for that.

But wait- there are two calculations happening in the microcovids -> micromorts cell, which makes it hard to edit if you disagree with me about the risk of covid. I’m going to move the /1,000,000 to the top cell so it’s easy to edit.

But the risk of catching covid outside isn’t zero. Microcovid says outdoors has 1/20th the risk. I’m very sure that’s out of date but don’t know the new number so I’ll make something up and list it separately so it’s easy to edit

But wait- I’m not necessarily with the same people indoors and out. The general density of people is comparable if I’m deciding to throw a party inside or outside, but not if I’m deciding to exercise outdoors or at a gym. So I should make that toggleable.

Eh, I’m still uncomfortable with that completely made up outdoor risk modifier. Let’s make it a range so we can see the scope of possible risks. Note that this only matters if we’re meeting people outdoors, which seems correct.

But that used guesstimate’s default probability distribution (log normal). I don’t see a reason probability density would concentrate at the low end of the distribution, so I switch it to normal.

Turns out to make very little difference in practice.

There are still a few problems here. Some of the numbers are more or less made up, and others have sources but I’ve done no work to verify them, which is almost as bad.

But unless the numbers are very off, covid is a full order of magnitude riskier than air pollution for the scenarios I picked. This makes me disinclined to spend a bunch of time tracking down better numbers.

Full list of limitations:

  • Only looks at micromorts, not QALYs
  • Individual adjustment basically made up, especially for pollution
  • Several numbers completely made up
  • Didn’t check any of my sources

Example: Individual’s chance of long covid given infection

This will be based on my post last year, Long covid is not necessarily your biggest problem, with some modification for pedagogical purposes. And made up numbers instead of real ones because the specific numbers have long been eclipsed by new data and strains. The final model is available here

Step one is to break your questions into subquestions. When I made this model a year ago, we only had data for baseline covid in unvaccinated people. Everyone wanted to know how vaccinations and the new strain would affect things. 

My first question was “can we predict long covid from acute covid?” I dug into the data and concluded “Yes”, the risk of long covid seemed to be very well correlated with acute severity. This informed the shape of the model but not any particular values. Some people disagreed with me, and they would make a very different model. 

Once I made that determination, the model was pretty easy to create: It looked like [risk of hospitalization with baseline covid] * [risk of long covid given hospitalization rate] * [vaccination risk modifier] * [strain hospitalization modifier] * [personal risk modifier]. Note that the model I’m creating here does not perfectly match the one created last year; I’ve modified it to be a better teaching example. 

The risk of hospitalization is easy to establish unless you start including undetected/asymptomatic cases. This has become a bigger deal as home tests became more available and mild cases became more common, since government statistics are missing more mild or asymptomatic cases. So in my calculation, I broke down the risk of hospitalization given covid to the known case hospitalization rate and then inserted a separate term based on my estimate of the number of uncaught cases. In the original post I chose some example people and used base estimates for them from the Economist data. In this model, I made something up.

Honestly, I don’t remember how I calculated the risk of long covid given the hospitalization rate. It was very complicated and a long time ago. This is why I write companion documents to explain my reasoning. 

Vaccination modifier was quite easy, every scientist was eager to tell us that. However, there are now questions about vaccines waning over time, and an individual’s protection level is likely to vary. Because of that, in this test model I have entered a range of vaccine efficacies, rather than a point estimate. An individual who knew how recently they were vaccinated might choose to collapse that down. 

Similarly, strain hospitalization modifiers take some time to assess, but are eventually straightforwardly available. Your estimate early in a new strain will probably have a much wider confidence interval than your estimate late in the same wave. 

By definition, I can’t set the personal risk modifier for every person looking at the model. I suggested people get a more accurate estimate of their personal risk using the Economist calculator, and then enter that in the model.

Lastly, there is a factor I called “are you feeling lucky?”. Some people don’t have anything diagnosable but know they get every cold twice; other people could get bitten by a plague rat with no ill effects. This is even more impossible to provide for an individual but is in fact pretty important for an individual’s risk assessment, so I included it as a term in the model. Individuals using the model can set it as they see fit, including to 1 if they don’t want to think about it.

When I put this together, I get this guesstimate. [#TODO screenshot]. Remember the numbers are completely made up. If you follow the link you can play around with it yourself, but your changes will not be saved. If anyone wants to update my model with modern strains and vaccine efficacy, I would be delighted.

Tips and Tricks

I’m undoubtedly missing many, so please comment with your own and I’ll update or create a new version later.

When working with modifiers, it’s easy to forget whether a large number is good or bad, and what the acceptable range is. It can be good to mark them with “0 to 1, higher is less risky”, or “between 0 and 1 = less risk, >1 = more risk”

If you enter a range, the default distribution is log-normal. If you want something different, change it. 

The formulas in the cells can get almost arbitrarily complicated, although it’s often not worth it. 

No, seriously, write out your sources and reasoning somewhere else because you will come back in six months and not remember what the hell you were thinking. Guesstimate is less a tool for holding your entire model and more a tool for forcing you to make your model explicit.

Separate judgment calls from empirical data, even if you’re really sure you are right. 


Thanks to Ozzie Gooen and his team for creating Guesstimate.

Thanks to the FTX Regrant program and a shy regrantor for funding this work.

Impact Shares For Speculative Projects


Recently I founded a new project with Jasen Murray, a close friend of several years. At founding the project was extremely amorphous (“preparadigmatic science: how does it work?”) and was going to exit that state slowly, if it at all. This made it a bad fit for traditional “apply for a grant, receive money, do work” style funding. The obvious answer is impact certificates, but the current state of the art there wasn’t an easy fit either. In addition to the object-level project, I’m interested in advancing the social tech of funding. With that in mind, Jasen and I negotiated a new system for allocating credit and funding.

This system is extremely experimental, so we have chosen not to make it binding. If we decide to do something different in a few months or a few years, we do not consider ourselves to have broken any promises. 

In the interest of advancing the overall tech, I wanted to share the considerations we have thought of and tentative conclusions. 

DALL-E rendering of impact shares


All of the following made traditional grant-based funding a bad fit:

  • Our project is currently very speculative and its outcomes are poorly defined. I expect it to be still speculative but at least a little more defined in a few months.
  • I have something that could be called integrity and could be called scrupulosity issues, which makes me feel strongly bound to follow plans I have written down and people have paid me for, to the point it can corrupt my epistemics. This makes accepting money while the project is so amorphous potentially quite harmful, even if the funders are on board with lots of uncertainty. 
  • When we started, I didn’t think I could put more than a few hours in per week, even if I had the time free, so I’m working more or less my regular freelancing hours and am not cash-constrained. 
  • The combination of my not being locally cash-constrained, money not speeding me up, and the high risk of corrupting my epistemics, makes me not want to accept money at this stage. But I would still like to get paid for the work eventually.
  • Jasen is more cash-constrained and is giving up hours at his regular work in order to further the project, so it would be very beneficial for him to get paid.
  • Jasen is much more resistant to epistemic pressure than I am, although still averse to making commitments about outcomes at this stage.

Why Not Impact Certificates?

Impact certificates have been discussed within Effective Altruism for several years, first by Paul Christiano and Katja Grace, who pitched it as “accepting money to metaphorically erase your impact”. Ben Hoffman had a really valuable addition with framing impact certificates as selling funder credits, rather than all of the credit. There is currently a project attempting to get impact certificates off the ground, but it’s aimed at people outside funding trust networks doing very defined work, which is basically the opposite of my problem. 

What my co-founder and I needed is something more like startup equity, where you are given a percentage credit for the project, and that percentage can be sold later, and the price is expected to change as the project bears fruit or fails to do so. If six months from now someone thinks my work is super valuable they are welcome to pay us, but we have not obligated ourselves to a particular person to produce a particular result.

Completely separate from this, I have always found the startup practice of denominating stock grants in “% of company”, distributing all the equity at the beginning but having it vest over time, and being able to dilute it at any time, kind of bullshit. What I consider more honest is distributing shares as you go and everyone recognizes that they don’t know what the total number of shares will be. This still provides a clean metric for comparing yourself to others and arguing about relative contributions, without any of the shadiness around percentages. This is mathematically identical to the standard system but I find the legibility preferable. 

The System

In Short

  • Every week Jasen and I accrue n impact shares in the project (“impact shares” is better than the first name we came up with, but probably a better name is out there). n is currently 50 because 100 is a very round number. 1000 felt too big and 10 made anything we gave too anyone else feel too small. This is entirely a sop to human psychology; mathematically it makes no difference.
  • Our advisor/first customer accrues a much smaller number, less than 1 per week, although we are still figuring out the exact number. 
  • Future funders will also receive impact shares, although this is an even more theoretical exercise than the rest of it because we don’t expect them to care about our system or negotiate on it. Funding going to just one of us comes out of that person’s share, funding going to both of us or the project at large, probably gets issued new shares. 
  • Future employees can negotiate payment in money and impact shares as they choose.
  • In the unlikely event we take on a co-founder level collaborator in the future, probably they will accrue impact shares at the same rate we do but will not get retroactive shares. 


Founder Shares

One issue we had to deal with was that Jasen would benefit from a salary right away, while I found a salary actively harmful, but wouldn’t mind having funding for expenses (this is not logical but it wasn’t worth the effort to fight it). We have decided that funding that is paying a salary is paid for with impact shares of the person receiving the salary, but funding for project expenses will be paid for either evenly out of both of our shared pools, or with new impact shares. 

We are allowed to have our impact shares go negative, so we can log salary payments in a lump sum, rather than having to deal with it each week.

Initially, we weren’t sure how we should split impact shares between the two of us. Eventually, we decided to fall back on the YCombinator advice that uneven splits between cofounders is always more trouble than it’s worth. But before then we did some thought experiments about what the project would look like with only one of us. I had initially wanted to give him more shares because he was putting in more time than me, but the thought experiments convinced us both that I was more counterfactually crucial and we agreed on 60/40 in my favor before reverting to a YC even split at my suggestion. 

My additional value came primarily from being more practical/applied. Applied work without theory is more useful than theory without application, so that’s one point for me. Additionally all the value comes from convincing people to use our suggestions, and I’m the one with the reputation and connections to do that. That’s in part because I’m more applied, but also because I’ve spent a long time working in public and Jasen had to be coaxed to allow his name on this document at all. I also know and am trusted by more funders, but I feel gross including that in the equation, especially when working with a close friend. 

We both felt like that exercise was very useful and grounding in assessing the project, even if we ultimately didn’t use its results. Jasen and I are very close friends and the relationship could handle the measuring of credit like that. I imagine many can’t, although it seems like a bad sign for a partnership overall. Or maybe we’re both too willing to give credit to other people and that’s easier to solve than wanting too much for ourselves. I think what I recommend is to do the exercise and unless you discover something really weird still split credit evenly, but that feels like a concession to practicality humanity will hopefully overcome. 

We initially discussed being able to give each other impact shares for particular pieces of work (one blog post, one insight, one meeting, etc). Eventually, we decided this was a terrible idea. It’s really easy to picture how we might have the same assessment of the other’s overall or average contribution but still vary widely in how we assess an individual contribution. For me, Jasen thinking one thing was 50% more valuable than I thought it was, did not feel good enough to make up for how bad it would be for him to think another contribution was half as valuable as I thought it was. For Jasen it was even worse because having his work overestimated felt almost as bad as having it underestimated. Plus it’s just a lot of friction and assessment of idea seeds when the whole point of this funding system is getting to wait to see how things turn out. So we agreed we would do occasional reassessments with months in between them, and of course we’re giving each other feedback constantly, but to not do quantified assessments at smaller intervals.

Neither of us wanted to track the hours we were putting into the project, that just seemed very annoying. 

So ultimately we decided to give ourselves the same number of impact shares each week, with the ability to retroactively gift shares or negotiate for a change in distribution going forward, but those should be spaced out by months at a minimum. 

Funding Shares

When we receive funding we credit the funder with impact shares. This will work roughly like startup equity: you assess how valuable the project is now, divide that by the number of outstanding shares, and that gets you a price per share. So if the project is currently $10,000 and we have 100 shares outstanding, the collaborator would have to give up 1 share to get $100.

Of course, startup equity works because the investors are making informed estimates of the value of the startup. We don’t expect initial funders to be very interested in that process with us, so probably we’ll be assessing ourselves on the honor system, maybe polling some other people. This is a pretty big flaw in the plan, but I think overall still a step forward in developing the coordination tech. 

In addition to the lack of outside evaluation, the equity system misses the concept of funder’s credit from Ben Hoffman’s blog post which I think is otherwise very valuable.  Ultimately we decided that impact shares are no worse than the current startup equity model, and that works pretty well. “No worse than startup equity” was a theme in much of our decision-making around this system. 

Advisor Shares

We are still figuring out how many impact shares to give our advisor/first customer. YC has standard advice for this (0.25%-1%), but YC’s advice assumes you will be diluting shares later, so the number is not directly applicable. Advisor mostly doesn’t care right now, because he doesn’t feel that this is taking much effort from him. 

It was very important to Jasen to give credit to people who got him to the starting line of this project, even if they were not directly involved in it. Recognizing them by giving them some of his impact shares felt really good to him, way more tangible than thanking mom after spiking a touchdown.


This is extremely experimental. I expect both the conventions around this to improve over time and for me and Jasen to improve our personal model as we work.  Some of that improvement will come from saying our current ideas and hearing the response, and I didn’t want to wait on starting that conversation, so here we are. 

Thanks to several people, especially Austin Chen and Raymond Arnold, for discussion on this topic.

Cognitive Risks of Adolescent Binge Drinking

The takeaway

Our goal was to quantify the cognitive risks of heavy but not abusive alcohol consumption. This is an inhernetly difficult task: the world is noisy, humans are highly variable, and institutional review boards won’t let us do challenge trials of known poisons. This makes strong inference or quantification of small risks incredibly difficult. We know for a fact that enough alcohol can damage you, and even levels that aren’t inherently dangerous can cause dumb decisions with long term consequences. All that said… when we tried to quantify the level of cognitive damage caused by college level binge drinking, we couldn’t demonstrate an effect. This doesn’t mean there isn’t one (if nothing else, “here, hold my beer” moments are real), just that it is below the threshold detectable with current methods and levels of variation in the population.


In discussions with recent college graduates I (Elizabeth) casually mentioned that alcohol is obviously damaging to cognition. They were shocked and dismayed to find their friends were poisoning themselves, and wanted the costs quantified so they could reason with them (I hang around a very specific set of college students). Martin Bernstorff and I set out to research this together. Ultimately, 90-95% of the research was done by him, with me mostly contributing strategic guidance and somewhere between editing and co-writing this post. 

I spent an hour getting DALL-E to draw this

Problems with research on drinking during adolescence

Literature on the causal medium- to long-term effects of non-alcoholism-level drinking on cognition is, to our strong surprise, extremely lacking. This isn’t just our poor research skills; in 2019, the Danish Ministry of Health attempted a comprehensive review and concluded that:

“We actually know relatively little about which specific biological consequences a high level of alcohol intake during adolescence will have on youth”.

And it isn’t because scientists are ignoring the problem either. Studying medium- and long-term effects on brain development is difficult because of the myriad of confounders and/or colliders for both cognition and alcohol consumption, and because more mechanist experiments would be very difficult and are institutionally forbidden anyway (“Dear IRB: we would like to violently poison some teenagers for four years, while forbidding the other half to engage in standard college socialization”). You could randomize abstinence, but we’ll get back to that.

One problem highly prevalent in alcohol literature is the abstinence bias. People who abstain from alcohol intake are likely to do so for a reason, for example chronic disease, being highly conscientious and religious, or a bad family history with alcohol. Even if you factor out all of the known confounders, it’s still vanishingly unlikely the drinking and non-drinking samples are identical. Whatever the differences, they’re likely to affect cognitive (and other) outcomes. 

Any analysis comparing “no drinking” to “drinking” will suffer from this by estimating the effect of no alcohol + confounders, rather than the effect of alcohol. Unfortunately, this rules out a surprising number of studies (code available upon request). 

Confounding is possible to mitigate if we have accurate intuition about the causal network, and we can estimate the effects of confounders accurately. We have to draw a directed acyclic graph with the relevant causal factors and adjust analyses or design accordingly. This is essential, but has not permeated all of epidemiology (yet), and especially for older literature, this is not done. For a primer, Martin recommends “Draw Your Assumptions” on edX here.

Additionally, alcohol consumption is a politically live topic, and papers are likely to be biased. Which direction is a coin flip: public health wants to make it seem scarier, alcohol companies want to make it seem safer. Unfortunately, these biases don’t cancel out, they just obfuscate everything.

What can we do when we know much of the literature is likely biased, but we do not have a strong idea about the size or direction?


If we aggregate multiple estimates that are wrong, but in different (and overall uncorrelated) directions, we will approximate the true effect. For health, we have a few dimensions that we can vary over: observational/interventional, age, and species.

Randomized abstinence studies

Ideally, we would have strong evidence from randomized controlled trials of abstinence. In experimental studies like this, there is no doubt about the direction of causality. And, since participants are randomized, confounders are evenly distributed between intervention and control groups. This means that our estimate of the intervention effect is unbiased by confounders, both measured and unmeasured.

However, we were only able to find two such studies, both from the 80s, among light drinkers (mean 3 standard units per week), and of a duration of only 2-6 weeks (Bimbaum et al., 1983; Hannon et al., 1987)

Bimbaum et al. did not stick to the randomisation when analyzing their data, opening the door to confounding:

Which should decrease our confidence in their study. They found no effect of abstinence on their 7 cognitive measures.

In Hannon et al., instruction to abstain vs. maintain resulted in a difference in alcohol intake of 12.5 units pr. week over 2 weeks. On the WAIS-R vocabulary test, abstaining women scored 55.5 ± 6.7 and maintaining women scored 51.0 ± 8.8 (both mean ± SD). On the 3 other cognitive tests performed, they found no difference.

Especially due to the short duration, we should be very wary of extrapolating too much from these studies. However, it appears that for moderate amounts of drinking over a short time period, total abstinence does not provide a meaningful benefit in the above studies.

Observational studies on humans

Due to their observational nature (as opposed to being an experiment), these studies are extremely vulnerable to confounders, colliders, reverse causality etc. However, they are relatively cheap ways of getting information, and are performed in naturalistic settings.

One meta-analysis (Neafsey & Collins, 2011) compared moderate social drinking (< 4 drinks/day) to non-drinkers (note: the definition of moderate varies a lot between studies). They partially compensated for the abstinence bias by excluding “former drinkers” from their reference group, i.e. removing people who’ve stopped drinking for medical (or other) reasons. This should provide a less biased estimate of the true effect. They found a protective effect of social drinking on a composite endpoint, “cognitive decline/dementia” (Odds Ratio 0.79 [0.75; 0.84]).

Interestingly, they also found that studies adjusting for age, education, sex and smoking-status did not have markedly different estimates from those that did not (ORadjusted 0.75 vs. ORun-adjusted 0.79). This should decrease our worry about confounding overall.

Observational studies on alcohol for infants

Another angle for triangulation is the effect of moderate maternal alcohol intake during pregnancy on the offspring’s IQ. The brain is never more vulnerable than during fetal development. There are obviously large differences between fetal and adolescent brains, so any generalization should be accompanied with large error bars. However, this might give us an upper bound.

(Zuccolo et al., 2013) perform an elegant example of what’s called Mendelian randomization.

A SNP variant in a gene (ADH1B) is associated with decreased alcohol consumption. Since SNP are near-randomly assigned (but see the examination of assumptions below), one can interpret it as the SNP causing decreased alcohol consumption. If some assumptions are met, that’s essentially a randomized controlled trial! Alas, these assumptions are extremely strong and unlikely to be totally true – but it can still be much better than merely comparing two groups with differing alcohol consumption.

As the authors very explicitly state, this analysis assumes that:

1. The SNP variant (rs1229984) decreases maternal alcohol consumption. This is confirmed in the data. Unfortunately, the authors do this by chi-square test (“does this alter consumption at all?”) rather than estimating the effect size. However, we can do our own calculations using Table 5:

If we round each alcohol consumption category to the mean of its bounds (0, 0.5, 3.5, 9), we get a mean intake in the SNP variant group of 0.55 units/week and a mean intake in the non-carrier of 0.88 units/week (math). This means that SNP-carrier mothers drink, on average, 0.33 units/week less. That’s a pretty small difference! We would’ve liked the authors to do this calculation themselves, and use it to report IQ-difference per unit of alcohol per week.

2. There is no association between the genotype and confounding factors, including other genes. This assumption is satisfied for all factors examined in the study, like maternal age, parity, education, smoking in 1st trimester etc. (Table 4), but unmeasured confounding is totally a thing! E.g. a SNP which correlates with the current variant and causes a change in the offspring’s IQ/KS2-score.

3. The genotype does not affect the outcome by any path other than maternal alcohol consumption, for example through affecting metabolism of alcohol.

If we believe these assumptions to be true, the authors are estimating the effect of 0.33 maternal alcohol units per week on the offspring’s IQ and KS2-score. KS2-score is a test of intellectual achievement (similar to the SAT) for 11-year-olds with a mean of 100 points and a standard deviation of ~15 points. 

They find that the 0.33 unit/week decrease does not affect IQ (mean difference -0.01 [-2.8; 2.7]) and causes a 1.7 point (with a 95% confidence interval of between 0.4 and 3.0) increase in KS2 score. 

This is extremely interesting. Additionally, the authors complete a classical epidemiological study, adjusting for typical confounders:

This shows that the children of pre-pregnancy heavy drinkers, on average, scored 8.62 (with a standard error of  1.12) points higher on IQ than non-drinkers, 2.99 points (SE 1.06) after adjusting for confounders. However, they didn’t adjust for alcohol intake in other parts of the pregnancy! Puzzlingly, first trimester drinking has an effect in the opposite direction: -3.14 points (SE 1.64) on IQ. However, this was also not adjusted for previous alcohol intake. This means that the estimates in table 1 (pre-pregnancy and first trimester) aren’t independent, but we don’t know how they’re correlated. Good luck teasing out the causal effect of maternal alcohol intake and timing from that.

Either way, the authors (and I) interpret the effects as being highly confounded; either residual (the confounder was measured with insufficient accuracy for complete adjustment) or unknown (confounders that weren’t measured). For example, pre-pregnancy alcohol intake was strongly associated with professional social class and education (upper-class wine-drinkers?), whereas the opposite was true for first trimester alcohol intake. Perhaps drinking while you know you’re pregnant is low social status?

If you’re like Elizabeth you’re probably surprised that drinking increases with social class. I didn’t dig into this deeply, but a quick search found that it does appear to hold up.

This result is in conflict with that of the Mendelian randomization, but it makes sense. Mendelian randomization is less sensitive to confounding, so maybe there is no true effect. Also, the study only estimated the genetic effect of a 0.33 units/week difference, so the analyses are probably not sufficiently powered. 

Taken together, the study should probably update towards a lack of harm from moderate (whatever that means) levels of alcohol intake, although how big an update that is depends on your previous position. We say “moderate” because fetal alcohol syndrome is definitely a thing, so at sufficient alcohol intake it’s obviously harmful! .


There is a decently sized, pretty well-conducted literature on adolescent intermittent ethanol exposure (science speak for “binge drinking on the weekend”). Rat adolescence is somewhat similar to human adolescence; it’s marked by sexual maturation, increased risk-taking and increased social play (Sengupta, 2013). The following is largely based on a deeper dive into the linked references from (Seemiller & Gould, 2020).

Adolescent intermittent ethanol exposure is typically operationalised as a blood-alcohol concentration of ~10 standard alcohol units, 0.5-3 times/day every 1-2 days during adolescence.

To interpret this, we make some big assumptions. Namely:

  1. Rodent blood-alcohol content can be translated 1:1 to human
  2. Effects on rodent cognition at a given alcohol concentration are similar to those on human cognition 
  3. Rodent adolescence can mimic human adolescence

Now, let’s dive in!

Two primary tasks are used in the literature:

The 5-choice serial reaction time task. 

Rodents are placed in a small box, and one of 5 holes is lit up. Rodents are measured at how good they are at touching the hole. 

Training in the 5-CSRTT varies between studies, but the two studies below consist of 6 training sessions at age 60 days. Initially, rats were rewarded with pellets from the feeder in the box to alert them to the possibility of reward. 

Afterwards, training sessions had gradually increasing difficulty. To begin with, the light stays on for 30 seconds to start, but the duration gradually decreases to 1 second. Rats progressed to the next training schedule based on either of 3 predefined criteria: 100 trials completed, >80% accuracy or <20% omissions. 

Naturally, you can measure a ton of stuff here! Generally, focus is on accuracy and omissions, but there are a ton of others:

From (Boutros et al., 2017) sup. table 1, congruent with (Semenova, 2012)

Now we know how they measured performance; but how did they imitate adolescent drinking?

Boutros et al. administered 5 g/kg of 25% ethanol through the mouth once per day in a 2-day on/off pattern, from age 28 days to 57 days – a total of 14 administrations. Based on blood alcohol content, this is equivalent to 10 standard units at each administration – quite a dose! Surprisingly, they found a decrease in omissions with the standard task, but no other systematic changes, in spite of 50+ analyses on variations of the measures (accuracy, omissions, correct responses, incorrect responses etc.) and task difficulty (length of the light staying on, whether they got the rats drunk etc.). We’d chalk this up to a chance finding.

Semenova et al. used the same training schedule, but administered 5 g/kg of 25% ethanol through the mouth every 8h for 4 days – a total of 12 administrations. They found small differences in different directions on different measures, but have the same multiple comparisons problem. Looks like noise to us.

The Barnes Maze 

Rodents are placed in the middle of an approximately 1m circle with 20-40 holes at the perimeter and are timed on how quickly they arrive at the hole with a reward (and escape box) below it. For timing spatial learning, the location of the hole is held constant. In (Coleman et al., 2014) and (Vetreno & Crews, 2012), rodents were timed once a day for 5 days. They were then given 4 days of rest, and the escape hole was relocated exactly 180° from the initial location. They were then timed again once a day, measuring relearning.

Figure: Tracing of the route taken by a control mouse right after the location was reversed, from Coleman et al., 2014.

Both studies found no effect of adolescent intermittent ethanol exposure on initial learning rate or errors. 

Vetreno found alcohol-exposed rats took longer to escape on their first trial but did equally well in all subsequent trials:

Whereas Coleman found a ~3x difference in performance on the relearning task, with similar half-times:

Somewhat suspiciously, even though Vetreno et al. is performed 2 years later than Coleman et al. and they share the same lab, they do not reference Coleman et al..

This does, technically, show an effect. However given the small size of effect, the number of metrics measured, file drawer effects, and the disagreement with the rest of the literature, we believe this is best treated as a null result.


So, what should we do? From the epidemiological literature, if you care about dementia risk, it looks like social drinking (i.e. excluding alcoholics) reduces your risk by ~20% as compared to not drinking. All other effects were part of a heterogenous literature with small effect sizes on cognition. Taking together, long-term cognitive effects of conventional alcohol intake during adolescence should play only a minor role in determining alcohol-intake.

Thanks to an FTX Future Fund regrantor for funding this work.

Bimbaum, I. M., Taylor, T. H., & Parker, E. S. (1983). Alcohol and Sober Mood State in Female Social Drinkers. Alcoholism: Clinical and Experimental Research, 7(4), 362–368.

Boutros, N., Der-Avakian, A., Markou, A., & Semenova, S. (2017). Effects of early life stress and adolescent ethanol exposure on adult cognitive performance in the 5-choice serial reaction time task in Wistar male rats. Psychopharmacology, 234(9), 1549–1556.

Coleman, L. G., Liu, W., Oguz, I., Styner, M., & Crews, F. T. (2014). Adolescent binge ethanol treatment alters adult brain regional volumes, cortical extracellular matrix protein and behavioral flexibility. Pharmacology Biochemistry and Behavior, 116, 142–151.

Hannon, R., Butler, C. P., Day, C. L., Khan, S. A., Quitoriano, L. A., Butler, A. M., & Meredith, L. A. (1987). Social drinking and cognitive functioning in college students: A replication and reversibility study. Journal of Studies on Alcohol, 48(5), 502–506.

Neafsey, E. J., & Collins, M. A. (2011). Moderate alcohol consumption and cognitive risk. Neuropsychiatric Disease and Treatment, 7, 465–484.

Seemiller, L. R., & Gould, T. J. (2020). The effects of adolescent alcohol exposure on learning and related neurobiology in humans and rodents. Neurobiology of Learning and Memory, 172, 107234.

Semenova, S. (2012). Attention, impulsivity, and cognitive flexibility in adult male rats exposed to ethanol binge during adolescence as measured in the five-choice serial reaction time task: The effects of task and ethanol challenges. Psychopharmacology, 219(2), 433–442.

Sengupta, P. (2013). The Laboratory Rat: Relating Its Age With Human’s. International Journal of Preventive Medicine, 4(6), 624–630.

Vetreno, R. P., & Crews, F. T. (2012). Adolescent binge drinking increases expression of the danger signal receptor agonist HMGB1 and toll-like receptors in the adult prefrontal cortex. Neuroscience, 226, 475–488.

Zuccolo, L., Lewis, S. J., Davey Smith, G., Sayal, K., Draper, E. S., Fraser, R., Barrow, M., Alati, R., Ring, S., Macleod, J., Golding, J., Heron, J., & Gray, R. (2013). Prenatal alcohol exposure and offspring cognition and school performance. A ‘Mendelian randomization’ natural experiment. International Journal of Epidemiology, 42(5), 1358–1370.

Quick Look: Asymptomatic Herpes Shedding

Tl;dr: Individuals shed and thus probably spread oral HSV1 while completely asymptomatic.


“Herpes virus” can refer to several viruses in the herpes family, including chickenpox and Epstein-Barr (which causes mono). All herpesviridae infections are for life: once infected, the virus will curl up in its cell of choice, possibly to leap out and begin reproduction again later. If the virus produces visible symptoms, it is called symptomatic. If the virus is producing viable virions that can infect other people, it’s called shedding. How correlated symptoms and shedding are is the topic of this post. 

When people say “herpes” without further specification, they typically mean herpes simplex 1 or 2. HSV1 and 2 are both permanent infections of nerve cells that can lay dormant forever, or intermittently cause painful blisters on mucous membranes (typically mouth or genitals, occasionally eyes, very occasionally elsewhere). There are also concerns about subtle long-term effects, which I do not go into here.

There are two conventional pieces of conventional wisdom on HSV: “you can shed infectious virus at any time, even without a sore. Most people who catch herpes catch it from an asymptomatic individual” and “99.9% of shedding occurs during or right before a blister and there are distinct signs you can recognize if you’re paying attention. If you can recognize an oncoming blister the chances of infecting another human are negligible.” At the request of a client I performed two hours of research to judge between these.

It is definitely true that doctors will only run tests looking for the virus directly (as opposed to antibodies) if you have an active sore. However when researchers proactively sampled asymptomatic individuals using either genetic material tests (PCR/NAAT, which look for viral DNA in a sample) or viral culture (which attempt to breed virus from your test sample in a petri dish), they reliably found some people are shedding virus. 

HSV1 prefers the mouth but is well known to infect genitals as well. HSV2 is almost exclusively genital. Due to a dearth of studies I’ve included some HSV2 and genital HSV1 studies. 


Tronstein et al: This paper stupidly lumped in “0% shedding” with “>0% shedding” and I hate them. Ignoring that, they found that 10% of all days recorded from individuals with asymptomatic genital HSV2 involved shedding, and these were distributed on a long tail, with the peak at 0-5%. I cannot tell if they lumped 0% and 0.1% together because 0% never happens, or because they hate science. 

your buckets are bad and you should feel bad

Bowman et al: 14% of previously symptomatic genital HSV2 patients shed isolate-able virus (sampled every 8 weeks over ~3 years) while on antivirals. This study reports “isolating” virus without further details; I expect this means viral culture. 

Sacks et al: citing another paper: shedding across 6% of days in oral HSV1 patients (using viral culture). It also found the following asymptomatic shedding rates for genital herpes

Spruance: oral HSV1 patients shed isolatable virus 7.4% of the time (including while symptomatic). 60% of this occurred while experiencing mild symptoms that could have indicated an upcoming sore, but never developed into a sore.

Tateish et al: tested 1000 samples from oral surgery patients (not filtered for HSV infection status). 4.7% had PCR-detectable herpes DNA, and 2.7% had culturable virus. This includes patients without herpes (about 50% of people in Japan, where the research was done), but oral surgery is stressful and often stems from issues that make it easier to shed herpes, so I consider those to ~cancel out. 


My conclusion: it is definitely possible to shed HSV while asymptomatic, including if you are never symptomatic. The daily shedding rate is something like 3-12%, although with lots of interpersonal variability. This doesn’t translate directly to an infectiousness rate: human mouths might be harder or easier to infect than petri dishes (my guess is harder, based on the continued existence of serodiscordant couples). It may be possible for people who are antibody positive for HSV to never shed virus but we don’t know because no one ran the right tests. 

Thanks to anonymous client for funding the initial research and my Patreon patrons for supporting the public write-up.

New Water Quality x Obesity Dataset Available

Tl;dr: I created a dataset of US counties’ water contamination and obesity levels. So far I have failed to find anything really interesting with it, but maybe you will. If you are interested you can download the dataset here. Be warned every spreadsheet program will choke on it; you definitely need to be use statistical programming.

Photocredit: DALL-E and a lot of coaxing 

Many of you have read Slime Mold Time Mold’s series on the hypothesis that environmental contaminants are driving weight gain. I haven’t done a deep dive on their work, but their lit review is certainly suggestive. 

SMTM did some original analysis by looking at obesity levels by state, but this is pretty hopeless. They’re using average altitude by state as a proxy for water purity for the entire state, and then correlating that with the state’s % resident obesity. Water contamination does seem negatively correlated with its altitude, and its altitude is correlated with an end-user’s altitude, and that end user’s altitude is correlated with their average state altitude… but I think that’s too many steps removed with too much noise at each step. So the aggregation by state is basically meaningless, except for showing us Colorado is weird.

So I dug up a better data set, which had contamination levels for almost every water system in the country, accessible by zip code, and another one that had obesity prevalence by county. I combined these into a single spreadsheet and did some very basic statistical analysis on them to look for correlations.

Some caveats before we start:

  • The dataset looks reasonable to me, but I haven’t examined it exhaustively and don’t know where the holes are. 
  • Slime Mold Time Mold’s top contender for environmental contagion is lithium. While technically present in the database, litium had five entries so I ignored it. I haven’t investigated but my guess is no one tests for lithium.
  • It’s rare, but some zip codes have multiple water suppliers, and the spreadsheet treats them as two separate entities that coincidentally have the same obesity prevalence.
  • I’ve made no attempt to back out basic confounding variables like income or age.
  • “% obese” is a much worse metric than average BMI, which is itself a much worse metric than % body fat. 
  • None of those metrics would catch if a contaminant makes some people very fat while making others thin ( SMTM thinks paradoxical effects are a big deal, so this is a major gap for testing their model).
  • Correlation still does not equal causation.

The correlations (for contaminants with >10k entries):

ContaminantCorrelation# Samples
Total haloacetic acids (HAAs)0.05514666
Barium (total)0.04017929
Total trihalomethanes (TTHMs)0.11721184
Nitrate & nitrite0.03511902
Lead (total)-0.00613031
Dichloroacetic acid-0.00310159

Of these, the only one that looks interesting is trihalomethanes, a chemical group that includes chloroform. Here’s the graph:

Visually this looks like the floor is rising much faster than the ceiling, but in a conversation on twitter SMTM suggested that’s an artifact of the biviariate distribution, it disappears if you look at log normal. 

Very casual googling suggests that TTHMs are definitely bad for pregnancy in sufficient quantities, and are maybe in a complicated relationship with Type 2 diabetes, but no slam dunks.

This is about as far as I’ve had time to get. My conclusions alas are not very actionable, but maybe someone else can do something interesting with the data.

Thanks to Austin Chen for zipping the two data sets together, Daniel Filan for doing additional data processing and statistical analysis, and my Patreon patrons for supporting this research.

Home Antigen Tests Aren’t Useful For Covid Screening

Epistemic status: I strongly believe this is the right conclusion given the available data. The best available data is not that good, and if better data comes out I reserve the right to change my opinion.

EDIT (4/27): In a development I consider deeply frustrating but probably ultimately good, the same office is now getting much more useful information from antigen tests. They aren’t tracking with same rigor so I can’t comapre results, but they are now beating the bar of “literally ever noticing covid”.

In an attempt to avoid covid without being miserable, many of my friends are hosting group events but requiring attendees to take a home covid test beforehand. Based on data from a medium-sized office, I believe testing for covid with the tests people are using, to be security theater and provide no decrease in riskAntigen tests don’t work for covid screening. There is a more expensive home test available that provides some value, and rapid PCR may still be viable.

It’s important to distinguish between test types here: antigen tests look for viral proteins, and genetic amplification tests amplify viral RNA until it reaches detectable levels. The latter are much more sensitive. Most home tests are antigen tests, with the exception of Cue, which uses NAAT (a type of genetic amplification). An office in the bay area used aggressive testing with both Cue and antigen tests to control covid in the office and kept meticulous notes, which they were kind enough to share with me. Here are the aggregated numbers: 

  • The office requested daily Cue tests from workers. I don’t know how many people this ultimately included, probably low hundreds? I expect compliance was >95% but not perfect.
    • The results are from January when the dominant strain was Omicron classic, but no one got strain tested.
  • 39 people had at least one positive Cue test, all of which were either asymptomatic or ambiguously symptomatic (e.g. symptoms could be explained by allergies) at the time, and 27 of which had recent negative cue tests (often but not always the day before, sometimes the same day)
  • Of these, 10 definitely went on to develop symptoms, 7 definitely did not, and 18 were ambiguous (and a few were missing data).
  • 33 people with positives were retested with cue tests, of which 9 were positive. 
  • Of those 24 who tested positive and then negative, 4 tested positive on tests 3 or 4.
  • Of the 20 people with a single positive test followed by multiple negative retests, 6 went on to develop symptoms.
  • 0 people tested positive on antigen tests. There was not a single positive antigen test across this group. They not only didn’t catch covid as early as Cue did, they did not catch any cases at all, including at least 2 people who took the tests while experiencing definitive systems.
    • Antigen tests were a mix of Binax and QuickVue.
    • Early cases took multiple antigen tests over several days, later cases stopped bothering entirely.
    • The “negative test while symptomatic” count is artificially low because I excluded people with ambiguous symptoms, and because later infectees didn’t bother with antigen tests. 
    • I suppose I can’t rule out the possibility that they had an unrelated disease with similar symptoms and a false positive on the Cue test. But it seems unlikely that that happened 10-28 times out a few hundred people without leaving other evidence.

A common defense of antigen tests is that they detect whether you’re contagious at that moment, not whether you will eventually become contagious. Given the existence of people who tested antigen-negative while Cue-positive and symptomatic, I can’t take that seriously.

Unfortunately Cue tests are very expensive. You need a dedicated reader, which is $250, and tests are $65 each (some discount if you sign up for a subscription). A reader can only run 1 test at a time and each test takes 30 minutes, so you need a lot for large gatherings even if people stagger their entrances. 

My contact’s best guess is that the aggressive testing reduced but did not eliminate in-office spread, but it’s hard to quantify because any given case could have been caught outside the office, and because they were trying so many interventions at once. Multiple people tested positive, took a second test right away, and got a negative result, some of whom went on to develop symptoms; we should probably assume the same chance of someone testing negative when a second test would have come back positive, and some of those would have been true positives. So even extremely aggressive testing has gaps.

Meanwhile, have I mentioned lately how good open windows and air purifiers are for covid? And other illnesses, and pollution? And that taping a HEPA filter to a box fan is a reasonable substitute for an air purifier achievable for a very small number of dollars? Have you changed your filter recently? 

PS. Before you throw your antigen tests out, note that they are more useful than Cue tests for determining if you’re over covid. Like PCR, NAAT can continue to pick up dead RNA for days, maybe weeks, after you have cleared the infection. A negative antigen test after symptoms have abated and there has been at least one positive test is still useful evidence to me. 

PPS. I went through some notes and back in September I estimated that antigen testing would catch 25-70% of presymptomatic covid cases. Omicron moves faster, maybe faster enough that 25% was reasonable for delta, but 70% looks obviously too high now. 

PPPS. Talked to another person at the office, their take is the Cue tests are oversensitive. I think this fits the data worse but feel obliged to pass it on since they were there and I wasn’t.

PPPPS (5/02): multiple people responded across platforms that they had gotten positive antigen tests. One or two of these was even presymptomatic. I acknowledge the existence proof but will not be updating until the data has a denominator. If you’re doing a large event like a conference I encourage you to give everyone both cue, antigen, and rapid PCR tests and record their results, and who eventually gets sick. If you’d like help designing this experiment in more detail please reach out (

I Caught Covid And All I Got Was This Lousy Ambiguous Data

Tl;dr I tried to run an n of 1 study on niacin and covid, and it failed to confirm or disprove anything at all.

You may remember that back in October I published a very long post investigating a niacin-based treatment protocol for long covid. My overall conclusion was “seems promising but not a slam dunk; I expect more rigorous investigation to show nothing but we should definitely check”. 

Well recently I got covid and had run out of more productive things I was capable of doing, so decided to test the niacin theory. I learned nothing but it was a lot of effort and I deserve a blog post out of it null results are still results so I’m sharing anyway.

Background On Niacin

Niacin is a B-vitamin used in a ton of metabolic processes. If you’re really curious, I describe it in excruciating detail in the original post.

All B vitamins are water-soluble, and it is generaly believed that unless you take unbelievably stupid doses you will pee out any excess intake without noticing. It’s much harder to build up stores of water-soluble vitamins than fat vitamins, so you need a more regular supply.  Niacin is a little weird among the water-solubles in that it gives very obvious signs of overdose: called flush, the symptoms consist of itchy skin and feeling overheated. Large doses can lead to uncontrolled shaking, but why would you ever take that much, when it’s so easy to avoid?

People regularly report response patterns that sure look like their body has a store of niacin that can be depleted and refilled over time. A dose someone has been taking for weeks or months will suddenly start giving them flush, and if they don’t lower it the flush symptoms will get worse and worse. 

Some forms of niacin don’t produce flush. Open question if those offer the same benefits with no side effects, offer fewer benefits, or are completely useless.

Niacin And Long Covid

There’s an elaborate hypothesis about how covid depletes niacin (and downstream products), and this is a contributor to long covid. My full analysis is here. As of last year I hadn’t had covid (this is antibody test confirmed, I definitely didn’t have an asymptomatic case) but I did have lingering symptoms from my vaccine and not a lot else to try, so I gave the protocol a shot.

My experience was pretty consistent with the niacin-storage theory. I spent a long time at quite a high dose of the form of niacin the protocol recommends, nictonic acid. My peak dose without flush was at  least 250mg (1563% RDA) and maybe even 375mg (2345% RDA). When I hit my limit I lowered my dose until I started getting flush at the new dose, and eventually went off nicotnic acid entirely (although I restarted a B-vitamin that included 313% RDA of a different form). That ended in September or early October 2021. It made no difference in my lingering vaccine symptoms.

In early 2022 I tried nicotinic acid again. Even ¼ tablet (62.5mg, 390% RDA) gave me flush.

I Get Covid

Once I developed symptoms and had done all the more obviously useful things like getting Paxlovid, I decided it would be fun to test myself with niacin (and the rest of the supplement stack discussed in my post) and see if covid had any effect. So during my two weeks of illness and week of recovery I occasionally took nicotinic acid and recorded my results. Here’s the overall timeline:

  1. Day -2: am exposed to covid.
  2. Day 0: test positive on a cue test (a home test that uses genetic amplification).
    1. Lung capacity test: 470 (over 400 is considered health).
    2. Start Fluvamoxine and the vitamin cocktail, although I’m inconsistent with both the new and existing vitamins during the worst of the illness. Vitamin cocktail includes 313% RDA of no-flush niacin, but not nicotinic acid. 
  3. Day 1: symptomatic AF. 102.3 degree fever, awake only long enough to pee, refill my water, and make sure my O2 saturation isn’t going to kill me. I eat nothing the entire day.
    1. I monitored my O2 throughout this adventure but it never went into a dangerous zone so I’m leaving it out of the rest of the story.
  4. Day 2: start with 99 degree fever, end day with no fever. Start Paxlovid.
    1. Every day after this I am awake a little bit longer, eat a little bit more, and have a little more cognitive energy, although it takes a while to get back to normal. 
    2. Try ¼ tab nicotinic acid (62.5 mg/ 375% RDA), no flush.
    3. Lung capacity troughs at 350 (considered orange zone).
  5. Day 4: ½ tablet nictonic acid, mild flush.
  6. Day 7: lung capacity up to 450, it will continue to vary from 430-450 for the next two weeks before occasionally going higher.
  7. Day 9: ½ tablet nictonic acid, mild flush
  8. Day 10-17: ⅓ tablet nictonic acid, no flush
    1. Where by “⅓” tablet I mean “I bit off an amount of pill that was definitely >¼ and <½ and probably averaged to ~⅓ over time”
  9. Day 12: I test positive on a home antigen test
  10. Day 15: I test negative on a home antigen test (no tests in between) 
  11. Day 17: ⅓ tablet produces flush (and a second negative antigen test)
    1. This was also the first day I left my house. I had thought of myself as still prone to fatigue but ended up having a lot of energy once I got out of my house and have been pretty okay since.


My case of covid was about as bad as you get while still technically counting as mild. Assuming I went into it with niacin stores such that 62.5mg nicotinic acid would generate flush, it looks like covid immediately took a small bite out of them. Or it reduced my absorption of vitamins, such that the same oral dosage resulted in less niacin being taken in. There’s no way to know covid had a larger effect on niacin than other illnesses, because I don’t have any to compare it to. Or maybe the whole thing was an artifact of “not eating for two days, and then only barely, and being inconsistent with my vitamins for a week”.