Impact Shares For Speculative Projects

Introduction

Recently I founded a new project with Jasen Murray, a close friend of several years. At founding the project was extremely amorphous (“preparadigmatic science: how does it work?”) and was going to exit that state slowly, if it at all. This made it a bad fit for traditional “apply for a grant, receive money, do work” style funding. The obvious answer is impact certificates, but the current state of the art there wasn’t an easy fit either. In addition to the object-level project, I’m interested in advancing the social tech of funding. With that in mind, Jasen and I negotiated a new system for allocating credit and funding.

This system is extremely experimental, so we have chosen not to make it binding. If we decide to do something different in a few months or a few years, we do not consider ourselves to have broken any promises. 

In the interest of advancing the overall tech, I wanted to share the considerations we have thought of and tentative conclusions. 

DALL-E rendering of impact shares

Considerations

All of the following made traditional grant-based funding a bad fit:

  • Our project is currently very speculative and its outcomes are poorly defined. I expect it to be still speculative but at least a little more defined in a few months.
  • I have something that could be called integrity and could be called scrupulosity issues, which makes me feel strongly bound to follow plans I have written down and people have paid me for, to the point it can corrupt my epistemics. This makes accepting money while the project is so amorphous potentially quite harmful, even if the funders are on board with lots of uncertainty. 
  • When we started, I didn’t think I could put more than a few hours in per week, even if I had the time free, so I’m working more or less my regular freelancing hours and am not cash-constrained. 
  • The combination of my not being locally cash-constrained, money not speeding me up, and the high risk of corrupting my epistemics, makes me not want to accept money at this stage. But I would still like to get paid for the work eventually.
  • Jasen is more cash-constrained and is giving up hours at his regular work in order to further the project, so it would be very beneficial for him to get paid.
  • Jasen is much more resistant to epistemic pressure than I am, although still averse to making commitments about outcomes at this stage.

Why Not Impact Certificates?

Impact certificates have been discussed within Effective Altruism for several years, first by Paul Christiano and Katja Grace, who pitched it as “accepting money to metaphorically erase your impact”. Ben Hoffman had a really valuable addition with framing impact certificates as selling funder credits, rather than all of the credit. There is currently a project attempting to get impact certificates off the ground, but it’s aimed at people outside funding trust networks doing very defined work, which is basically the opposite of my problem. 

What my co-founder and I needed is something more like startup equity, where you are given a percentage credit for the project, and that percentage can be sold later, and the price is expected to change as the project bears fruit or fails to do so. If six months from now someone thinks my work is super valuable they are welcome to pay us, but we have not obligated ourselves to a particular person to produce a particular result.

Completely separate from this, I have always found the startup practice of denominating stock grants in “% of company”, distributing all the equity at the beginning but having it vest over time, and being able to dilute it at any time, kind of bullshit. What I consider more honest is distributing shares as you go and everyone recognizes that they don’t know what the total number of shares will be. This still provides a clean metric for comparing yourself to others and arguing about relative contributions, without any of the shadiness around percentages. This is mathematically identical to the standard system but I find the legibility preferable. 

The System

In Short

  • Every week Jasen and I accrue n impact shares in the project (“impact shares” is better than the first name we came up with, but probably a better name is out there). n is currently 50 because 100 is a very round number. 1000 felt too big and 10 made anything we gave too anyone else feel too small. This is entirely a sop to human psychology; mathematically it makes no difference.
  • Our advisor/first customer accrues a much smaller number, less than 1 per week, although we are still figuring out the exact number. 
  • Future funders will also receive impact shares, although this is an even more theoretical exercise than the rest of it because we don’t expect them to care about our system or negotiate on it. Funding going to just one of us comes out of that person’s share, funding going to both of us or the project at large, probably gets issued new shares. 
  • Future employees can negotiate payment in money and impact shares as they choose.
  • In the unlikely event we take on a co-founder level collaborator in the future, probably they will accrue impact shares at the same rate we do but will not get retroactive shares. 

Details

Founder Shares

One issue we had to deal with was that Jasen would benefit from a salary right away, while I found a salary actively harmful, but wouldn’t mind having funding for expenses (this is not logical but it wasn’t worth the effort to fight it). We have decided that funding that is paying a salary is paid for with impact shares of the person receiving the salary, but funding for project expenses will be paid for either evenly out of both of our shared pools, or with new impact shares. 

We are allowed to have our impact shares go negative, so we can log salary payments in a lump sum, rather than having to deal with it each week.

Initially, we weren’t sure how we should split impact shares between the two of us. Eventually, we decided to fall back on the YCombinator advice that uneven splits between cofounders is always more trouble than it’s worth. But before then we did some thought experiments about what the project would look like with only one of us. I had initially wanted to give him more shares because he was putting in more time than me, but the thought experiments convinced us both that I was more counterfactually crucial and we agreed on 60/40 in my favor before reverting to a YC even split at my suggestion. 

My additional value came primarily from being more practical/applied. Applied work without theory is more useful than theory without application, so that’s one point for me. Additionally all the value comes from convincing people to use our suggestions, and I’m the one with the reputation and connections to do that. That’s in part because I’m more applied, but also because I’ve spent a long time working in public and Jasen had to be coaxed to allow his name on this document at all. I also know and am trusted by more funders, but I feel gross including that in the equation, especially when working with a close friend. 

We both felt like that exercise was very useful and grounding in assessing the project, even if we ultimately didn’t use its results. Jasen and I are very close friends and the relationship could handle the measuring of credit like that. I imagine many can’t, although it seems like a bad sign for a partnership overall. Or maybe we’re both too willing to give credit to other people and that’s easier to solve than wanting too much for ourselves. I think what I recommend is to do the exercise and unless you discover something really weird still split credit evenly, but that feels like a concession to practicality humanity will hopefully overcome. 

We initially discussed being able to give each other impact shares for particular pieces of work (one blog post, one insight, one meeting, etc). Eventually, we decided this was a terrible idea. It’s really easy to picture how we might have the same assessment of the other’s overall or average contribution but still vary widely in how we assess an individual contribution. For me, Jasen thinking one thing was 50% more valuable than I thought it was, did not feel good enough to make up for how bad it would be for him to think another contribution was half as valuable as I thought it was. For Jasen it was even worse because having his work overestimated felt almost as bad as having it underestimated. Plus it’s just a lot of friction and assessment of idea seeds when the whole point of this funding system is getting to wait to see how things turn out. So we agreed we would do occasional reassessments with months in between them, and of course we’re giving each other feedback constantly, but to not do quantified assessments at smaller intervals.

Neither of us wanted to track the hours we were putting into the project, that just seemed very annoying. 

So ultimately we decided to give ourselves the same number of impact shares each week, with the ability to retroactively gift shares or negotiate for a change in distribution going forward, but those should be spaced out by months at a minimum. 

Funding Shares

When we receive funding we credit the funder with impact shares. This will work roughly like startup equity: you assess how valuable the project is now, divide that by the number of outstanding shares, and that gets you a price per share. So if the project is currently $10,000 and we have 100 shares outstanding, the collaborator would have to give up 1 share to get $100.

Of course, startup equity works because the investors are making informed estimates of the value of the startup. We don’t expect initial funders to be very interested in that process with us, so probably we’ll be assessing ourselves on the honor system, maybe polling some other people. This is a pretty big flaw in the plan, but I think overall still a step forward in developing the coordination tech. 

In addition to the lack of outside evaluation, the equity system misses the concept of funder’s credit from Ben Hoffman’s blog post which I think is otherwise very valuable.  Ultimately we decided that impact shares are no worse than the current startup equity model, and that works pretty well. “No worse than startup equity” was a theme in much of our decision-making around this system. 

Advisor Shares

We are still figuring out how many impact shares to give our advisor/first customer. YC has standard advice for this (0.25%-1%), but YC’s advice assumes you will be diluting shares later, so the number is not directly applicable. Advisor mostly doesn’t care right now, because he doesn’t feel that this is taking much effort from him. 

It was very important to Jasen to give credit to people who got him to the starting line of this project, even if they were not directly involved in it. Recognizing them by giving them some of his impact shares felt really good to him, way more tangible than thanking mom after spiking a touchdown.

Closing

This is extremely experimental. I expect both the conventions around this to improve over time and for me and Jasen to improve our personal model as we work.  Some of that improvement will come from saying our current ideas and hearing the response, and I didn’t want to wait on starting that conversation, so here we are. 

Thanks to several people, especially Austin Chen and Raymond Arnold, for discussion on this topic.

Cognitive Risks of Adolescent Binge Drinking

The takeaway

Our goal was to quantify the cognitive risks of heavy but not abusive alcohol consumption. This is an inhernetly difficult task: the world is noisy, humans are highly variable, and institutional review boards won’t let us do challenge trials of known poisons. This makes strong inference or quantification of small risks incredibly difficult. We know for a fact that enough alcohol can damage you, and even levels that aren’t inherently dangerous can cause dumb decisions with long term consequences. All that said… when we tried to quantify the level of cognitive damage caused by college level binge drinking, we couldn’t demonstrate an effect. This doesn’t mean there isn’t one (if nothing else, “here, hold my beer” moments are real), just that it is below the threshold detectable with current methods and levels of variation in the population.

Motivation

In discussions with recent college graduates I (Elizabeth) casually mentioned that alcohol is obviously damaging to cognition. They were shocked and dismayed to find their friends were poisoning themselves, and wanted the costs quantified so they could reason with them (I hang around a very specific set of college students). Martin Bernstorff and I set out to research this together. Ultimately, 90-95% of the research was done by him, with me mostly contributing strategic guidance and somewhere between editing and co-writing this post. 

I spent an hour getting DALL-E to draw this

Problems with research on drinking during adolescence

Literature on the causal medium- to long-term effects of non-alcoholism-level drinking on cognition is, to our strong surprise, extremely lacking. This isn’t just our poor research skills; in 2019, the Danish Ministry of Health attempted a comprehensive review and concluded that:

“We actually know relatively little about which specific biological consequences a high level of alcohol intake during adolescence will have on youth”.

And it isn’t because scientists are ignoring the problem either. Studying medium- and long-term effects on brain development is difficult because of the myriad of confounders and/or colliders for both cognition and alcohol consumption, and because more mechanist experiments would be very difficult and are institutionally forbidden anyway (“Dear IRB: we would like to violently poison some teenagers for four years, while forbidding the other half to engage in standard college socialization”). You could randomize abstinence, but we’ll get back to that.

One problem highly prevalent in alcohol literature is the abstinence bias. People who abstain from alcohol intake are likely to do so for a reason, for example chronic disease, being highly conscientious and religious, or a bad family history with alcohol. Even if you factor out all of the known confounders, it’s still vanishingly unlikely the drinking and non-drinking samples are identical. Whatever the differences, they’re likely to affect cognitive (and other) outcomes. 

Any analysis comparing “no drinking” to “drinking” will suffer from this by estimating the effect of no alcohol + confounders, rather than the effect of alcohol. Unfortunately, this rules out a surprising number of studies (code available upon request). 

Confounding is possible to mitigate if we have accurate intuition about the causal network, and we can estimate the effects of confounders accurately. We have to draw a directed acyclic graph with the relevant causal factors and adjust analyses or design accordingly. This is essential, but has not permeated all of epidemiology (yet), and especially for older literature, this is not done. For a primer, Martin recommends “Draw Your Assumptions” on edX here.

Additionally, alcohol consumption is a politically live topic, and papers are likely to be biased. Which direction is a coin flip: public health wants to make it seem scarier, alcohol companies want to make it seem safer. Unfortunately, these biases don’t cancel out, they just obfuscate everything.

What can we do when we know much of the literature is likely biased, but we do not have a strong idea about the size or direction?

Triangulation

If we aggregate multiple estimates that are wrong, but in different (and overall uncorrelated) directions, we will approximate the true effect. For health, we have a few dimensions that we can vary over: observational/interventional, age, and species.

Randomized abstinence studies

Ideally, we would have strong evidence from randomized controlled trials of abstinence. In experimental studies like this, there is no doubt about the direction of causality. And, since participants are randomized, confounders are evenly distributed between intervention and control groups. This means that our estimate of the intervention effect is unbiased by confounders, both measured and unmeasured.

However, we were only able to find two such studies, both from the 80s, among light drinkers (mean 3 standard units per week), and of a duration of only 2-6 weeks (Bimbaum et al., 1983; Hannon et al., 1987)

Bimbaum et al. did not stick to the randomisation when analyzing their data, opening the door to confounding:

Which should decrease our confidence in their study. They found no effect of abstinence on their 7 cognitive measures.

In Hannon et al., instruction to abstain vs. maintain resulted in a difference in alcohol intake of 12.5 units pr. week over 2 weeks. On the WAIS-R vocabulary test, abstaining women scored 55.5 ± 6.7 and maintaining women scored 51.0 ± 8.8 (both mean ± SD). On the 3 other cognitive tests performed, they found no difference.

Especially due to the short duration, we should be very wary of extrapolating too much from these studies. However, it appears that for moderate amounts of drinking over a short time period, total abstinence does not provide a meaningful benefit in the above studies.

Observational studies on humans

Due to their observational nature (as opposed to being an experiment), these studies are extremely vulnerable to confounders, colliders, reverse causality etc. However, they are relatively cheap ways of getting information, and are performed in naturalistic settings.

One meta-analysis (Neafsey & Collins, 2011) compared moderate social drinking (< 4 drinks/day) to non-drinkers (note: the definition of moderate varies a lot between studies). They partially compensated for the abstinence bias by excluding “former drinkers” from their reference group, i.e. removing people who’ve stopped drinking for medical (or other) reasons. This should provide a less biased estimate of the true effect. They found a protective effect of social drinking on a composite endpoint, “cognitive decline/dementia” (Odds Ratio 0.79 [0.75; 0.84]).

Interestingly, they also found that studies adjusting for age, education, sex and smoking-status did not have markedly different estimates from those that did not (ORadjusted 0.75 vs. ORun-adjusted 0.79). This should decrease our worry about confounding overall.

Observational studies on alcohol for infants

Another angle for triangulation is the effect of moderate maternal alcohol intake during pregnancy on the offspring’s IQ. The brain is never more vulnerable than during fetal development. There are obviously large differences between fetal and adolescent brains, so any generalization should be accompanied with large error bars. However, this might give us an upper bound.

(Zuccolo et al., 2013) perform an elegant example of what’s called Mendelian randomization.

A SNP variant in a gene (ADH1B) is associated with decreased alcohol consumption. Since SNP are near-randomly assigned (but see the examination of assumptions below), one can interpret it as the SNP causing decreased alcohol consumption. If some assumptions are met, that’s essentially a randomized controlled trial! Alas, these assumptions are extremely strong and unlikely to be totally true – but it can still be much better than merely comparing two groups with differing alcohol consumption.

As the authors very explicitly state, this analysis assumes that:

1. The SNP variant (rs1229984) decreases maternal alcohol consumption. This is confirmed in the data. Unfortunately, the authors do this by chi-square test (“does this alter consumption at all?”) rather than estimating the effect size. However, we can do our own calculations using Table 5:

If we round each alcohol consumption category to the mean of its bounds (0, 0.5, 3.5, 9), we get a mean intake in the SNP variant group of 0.55 units/week and a mean intake in the non-carrier of 0.88 units/week (math). This means that SNP-carrier mothers drink, on average, 0.33 units/week less. That’s a pretty small difference! We would’ve liked the authors to do this calculation themselves, and use it to report IQ-difference per unit of alcohol per week.

2. There is no association between the genotype and confounding factors, including other genes. This assumption is satisfied for all factors examined in the study, like maternal age, parity, education, smoking in 1st trimester etc. (Table 4), but unmeasured confounding is totally a thing! E.g. a SNP which correlates with the current variant and causes a change in the offspring’s IQ/KS2-score.

3. The genotype does not affect the outcome by any path other than maternal alcohol consumption, for example through affecting metabolism of alcohol.

If we believe these assumptions to be true, the authors are estimating the effect of 0.33 maternal alcohol units per week on the offspring’s IQ and KS2-score. KS2-score is a test of intellectual achievement (similar to the SAT) for 11-year-olds with a mean of 100 points and a standard deviation of ~15 points. 

They find that the 0.33 unit/week decrease does not affect IQ (mean difference -0.01 [-2.8; 2.7]) and causes a 1.7 point (with a 95% confidence interval of between 0.4 and 3.0) increase in KS2 score. 

This is extremely interesting. Additionally, the authors complete a classical epidemiological study, adjusting for typical confounders:

This shows that the children of pre-pregnancy heavy drinkers, on average, scored 8.62 (with a standard error of  1.12) points higher on IQ than non-drinkers, 2.99 points (SE 1.06) after adjusting for confounders. However, they didn’t adjust for alcohol intake in other parts of the pregnancy! Puzzlingly, first trimester drinking has an effect in the opposite direction: -3.14 points (SE 1.64) on IQ. However, this was also not adjusted for previous alcohol intake. This means that the estimates in table 1 (pre-pregnancy and first trimester) aren’t independent, but we don’t know how they’re correlated. Good luck teasing out the causal effect of maternal alcohol intake and timing from that.

Either way, the authors (and I) interpret the effects as being highly confounded; either residual (the confounder was measured with insufficient accuracy for complete adjustment) or unknown (confounders that weren’t measured). For example, pre-pregnancy alcohol intake was strongly associated with professional social class and education (upper-class wine-drinkers?), whereas the opposite was true for first trimester alcohol intake. Perhaps drinking while you know you’re pregnant is low social status?

If you’re like Elizabeth you’re probably surprised that drinking increases with social class. I didn’t dig into this deeply, but a quick search found that it does appear to hold up.

This result is in conflict with that of the Mendelian randomization, but it makes sense. Mendelian randomization is less sensitive to confounding, so maybe there is no true effect. Also, the study only estimated the genetic effect of a 0.33 units/week difference, so the analyses are probably not sufficiently powered. 

Taken together, the study should probably update towards a lack of harm from moderate (whatever that means) levels of alcohol intake, although how big an update that is depends on your previous position. We say “moderate” because fetal alcohol syndrome is definitely a thing, so at sufficient alcohol intake it’s obviously harmful! .

Rodents

There is a decently sized, pretty well-conducted literature on adolescent intermittent ethanol exposure (science speak for “binge drinking on the weekend”). Rat adolescence is somewhat similar to human adolescence; it’s marked by sexual maturation, increased risk-taking and increased social play (Sengupta, 2013). The following is largely based on a deeper dive into the linked references from (Seemiller & Gould, 2020).

Adolescent intermittent ethanol exposure is typically operationalised as a blood-alcohol concentration of ~10 standard alcohol units, 0.5-3 times/day every 1-2 days during adolescence.

To interpret this, we make some big assumptions. Namely:

  1. Rodent blood-alcohol content can be translated 1:1 to human
  2. Effects on rodent cognition at a given alcohol concentration are similar to those on human cognition 
  3. Rodent adolescence can mimic human adolescence

Now, let’s dive in!

Two primary tasks are used in the literature:

The 5-choice serial reaction time task. 

Rodents are placed in a small box, and one of 5 holes is lit up. Rodents are measured at how good they are at touching the hole. 

Training in the 5-CSRTT varies between studies, but the two studies below consist of 6 training sessions at age 60 days. Initially, rats were rewarded with pellets from the feeder in the box to alert them to the possibility of reward. 

Afterwards, training sessions had gradually increasing difficulty. To begin with, the light stays on for 30 seconds to start, but the duration gradually decreases to 1 second. Rats progressed to the next training schedule based on either of 3 predefined criteria: 100 trials completed, >80% accuracy or <20% omissions. 

Naturally, you can measure a ton of stuff here! Generally, focus is on accuracy and omissions, but there are a ton of others:

From (Boutros et al., 2017) sup. table 1, congruent with (Semenova, 2012)

Now we know how they measured performance; but how did they imitate adolescent drinking?

Boutros et al. administered 5 g/kg of 25% ethanol through the mouth once per day in a 2-day on/off pattern, from age 28 days to 57 days – a total of 14 administrations. Based on blood alcohol content, this is equivalent to 10 standard units at each administration – quite a dose! Surprisingly, they found a decrease in omissions with the standard task, but no other systematic changes, in spite of 50+ analyses on variations of the measures (accuracy, omissions, correct responses, incorrect responses etc.) and task difficulty (length of the light staying on, whether they got the rats drunk etc.). We’d chalk this up to a chance finding.

Semenova et al. used the same training schedule, but administered 5 g/kg of 25% ethanol through the mouth every 8h for 4 days – a total of 12 administrations. They found small differences in different directions on different measures, but have the same multiple comparisons problem. Looks like noise to us.

The Barnes Maze 

Rodents are placed in the middle of an approximately 1m circle with 20-40 holes at the perimeter and are timed on how quickly they arrive at the hole with a reward (and escape box) below it. For timing spatial learning, the location of the hole is held constant. In (Coleman et al., 2014) and (Vetreno & Crews, 2012), rodents were timed once a day for 5 days. They were then given 4 days of rest, and the escape hole was relocated exactly 180° from the initial location. They were then timed again once a day, measuring relearning.


Figure: Tracing of the route taken by a control mouse right after the location was reversed, from Coleman et al., 2014.

Both studies found no effect of adolescent intermittent ethanol exposure on initial learning rate or errors. 

Vetreno found alcohol-exposed rats took longer to escape on their first trial but did equally well in all subsequent trials:

Whereas Coleman found a ~3x difference in performance on the relearning task, with similar half-times:

Somewhat suspiciously, even though Vetreno et al. is performed 2 years later than Coleman et al. and they share the same lab, they do not reference Coleman et al..

This does, technically, show an effect. However given the small size of effect, the number of metrics measured, file drawer effects, and the disagreement with the rest of the literature, we believe this is best treated as a null result.

Conclusion

So, what should we do? From the epidemiological literature, if you care about dementia risk, it looks like social drinking (i.e. excluding alcoholics) reduces your risk by ~20% as compared to not drinking. All other effects were part of a heterogenous literature with small effect sizes on cognition. Taking together, long-term cognitive effects of conventional alcohol intake during adolescence should play only a minor role in determining alcohol-intake.

Thanks to an FTX Future Fund regrantor for funding this work.

Bimbaum, I. M., Taylor, T. H., & Parker, E. S. (1983). Alcohol and Sober Mood State in Female Social Drinkers. Alcoholism: Clinical and Experimental Research, 7(4), 362–368. https://doi.org/10.1111/j.1530-0277.1983.tb05483.x

Boutros, N., Der-Avakian, A., Markou, A., & Semenova, S. (2017). Effects of early life stress and adolescent ethanol exposure on adult cognitive performance in the 5-choice serial reaction time task in Wistar male rats. Psychopharmacology, 234(9), 1549–1556. https://doi.org/10.1007/s00213-017-4555-3

Coleman, L. G., Liu, W., Oguz, I., Styner, M., & Crews, F. T. (2014). Adolescent binge ethanol treatment alters adult brain regional volumes, cortical extracellular matrix protein and behavioral flexibility. Pharmacology Biochemistry and Behavior, 116, 142–151. https://doi.org/10.1016/j.pbb.2013.11.021

Hannon, R., Butler, C. P., Day, C. L., Khan, S. A., Quitoriano, L. A., Butler, A. M., & Meredith, L. A. (1987). Social drinking and cognitive functioning in college students: A replication and reversibility study. Journal of Studies on Alcohol, 48(5), 502–506. https://doi.org/10.15288/jsa.1987.48.502

Neafsey, E. J., & Collins, M. A. (2011). Moderate alcohol consumption and cognitive risk. Neuropsychiatric Disease and Treatment, 7, 465–484. https://doi.org/10.2147/NDT.S23159

Seemiller, L. R., & Gould, T. J. (2020). The effects of adolescent alcohol exposure on learning and related neurobiology in humans and rodents. Neurobiology of Learning and Memory, 172, 107234. https://doi.org/10.1016/j.nlm.2020.107234

Semenova, S. (2012). Attention, impulsivity, and cognitive flexibility in adult male rats exposed to ethanol binge during adolescence as measured in the five-choice serial reaction time task: The effects of task and ethanol challenges. Psychopharmacology, 219(2), 433–442. https://doi.org/10.1007/s00213-011-2458-2

Sengupta, P. (2013). The Laboratory Rat: Relating Its Age With Human’s. International Journal of Preventive Medicine, 4(6), 624–630. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3733029/

Vetreno, R. P., & Crews, F. T. (2012). Adolescent binge drinking increases expression of the danger signal receptor agonist HMGB1 and toll-like receptors in the adult prefrontal cortex. Neuroscience, 226, 475–488. https://doi.org/10.1016/j.neuroscience.2012.08.046

Zuccolo, L., Lewis, S. J., Davey Smith, G., Sayal, K., Draper, E. S., Fraser, R., Barrow, M., Alati, R., Ring, S., Macleod, J., Golding, J., Heron, J., & Gray, R. (2013). Prenatal alcohol exposure and offspring cognition and school performance. A ‘Mendelian randomization’ natural experiment. International Journal of Epidemiology, 42(5), 1358–1370. https://doi.org/10.1093/ije/dyt172

Quick Look: Asymptomatic Herpes Shedding

Tl;dr: Individuals shed and thus probably spread oral HSV1 while completely asymptomatic.

Introduction

“Herpes virus” can refer to several viruses in the herpes family, including chickenpox and Epstein-Barr (which causes mono). All herpesviridae infections are for life: once infected, the virus will curl up in its cell of choice, possibly to leap out and begin reproduction again later. If the virus produces visible symptoms, it is called symptomatic. If the virus is producing viable virions that can infect other people, it’s called shedding. How correlated symptoms and shedding are is the topic of this post. 

When people say “herpes” without further specification, they typically mean herpes simplex 1 or 2. HSV1 and 2 are both permanent infections of nerve cells that can lay dormant forever, or intermittently cause painful blisters on mucous membranes (typically mouth or genitals, occasionally eyes, very occasionally elsewhere). There are also concerns about subtle long-term effects, which I do not go into here.

There are two conventional pieces of conventional wisdom on HSV: “you can shed infectious virus at any time, even without a sore. Most people who catch herpes catch it from an asymptomatic individual” and “99.9% of shedding occurs during or right before a blister and there are distinct signs you can recognize if you’re paying attention. If you can recognize an oncoming blister the chances of infecting another human are negligible.” At the request of a client I performed two hours of research to judge between these.

It is definitely true that doctors will only run tests looking for the virus directly (as opposed to antibodies) if you have an active sore. However when researchers proactively sampled asymptomatic individuals using either genetic material tests (PCR/NAAT, which look for viral DNA in a sample) or viral culture (which attempt to breed virus from your test sample in a petri dish), they reliably found some people are shedding virus. 

HSV1 prefers the mouth but is well known to infect genitals as well. HSV2 is almost exclusively genital. Due to a dearth of studies I’ve included some HSV2 and genital HSV1 studies. 

Studies

Tronstein et al: This paper stupidly lumped in “0% shedding” with “>0% shedding” and I hate them. Ignoring that, they found that 10% of all days recorded from individuals with asymptomatic genital HSV2 involved shedding, and these were distributed on a long tail, with the peak at 0-5%. I cannot tell if they lumped 0% and 0.1% together because 0% never happens, or because they hate science. 

your buckets are bad and you should feel bad

Bowman et al: 14% of previously symptomatic genital HSV2 patients shed isolate-able virus (sampled every 8 weeks over ~3 years) while on antivirals. This study reports “isolating” virus without further details; I expect this means viral culture. 

Sacks et al: citing another paper: shedding across 6% of days in oral HSV1 patients (using viral culture). It also found the following asymptomatic shedding rates for genital herpes

Spruance: oral HSV1 patients shed isolatable virus 7.4% of the time (including while symptomatic). 60% of this occurred while experiencing mild symptoms that could have indicated an upcoming sore, but never developed into a sore.

Tateish et al: tested 1000 samples from oral surgery patients (not filtered for HSV infection status). 4.7% had PCR-detectable herpes DNA, and 2.7% had culturable virus. This includes patients without herpes (about 50% of people in Japan, where the research was done), but oral surgery is stressful and often stems from issues that make it easier to shed herpes, so I consider those to ~cancel out. 

Conclusion

My conclusion: it is definitely possible to shed HSV while asymptomatic, including if you are never symptomatic. The daily shedding rate is something like 3-12%, although with lots of interpersonal variability. This doesn’t translate directly to an infectiousness rate: human mouths might be harder or easier to infect than petri dishes (my guess is harder, based on the continued existence of serodiscordant couples). It may be possible for people who are antibody positive for HSV to never shed virus but we don’t know because no one ran the right tests. 

Thanks to anonymous client for funding the initial research and my Patreon patrons for supporting the public write-up.

New Water Quality x Obesity Dataset Available

Tl;dr: I created a dataset of US counties’ water contamination and obesity levels. So far I have failed to find anything really interesting with it, but maybe you will. If you are interested you can download the dataset here. Be warned every spreadsheet program will choke on it; you definitely need to be use statistical programming.

Photocredit: DALL-E and a lot of coaxing 

Many of you have read Slime Mold Time Mold’s series on the hypothesis that environmental contaminants are driving weight gain. I haven’t done a deep dive on their work, but their lit review is certainly suggestive. 

SMTM did some original analysis by looking at obesity levels by state, but this is pretty hopeless. They’re using average altitude by state as a proxy for water purity for the entire state, and then correlating that with the state’s % resident obesity. Water contamination does seem negatively correlated with its altitude, and its altitude is correlated with an end-user’s altitude, and that end user’s altitude is correlated with their average state altitude… but I think that’s too many steps removed with too much noise at each step. So the aggregation by state is basically meaningless, except for showing us Colorado is weird.

So I dug up a better data set, which had contamination levels for almost every water system in the country, accessible by zip code, and another one that had obesity prevalence by county. I combined these into a single spreadsheet and did some very basic statistical analysis on them to look for correlations.

Some caveats before we start:

  • The dataset looks reasonable to me, but I haven’t examined it exhaustively and don’t know where the holes are. 
  • Slime Mold Time Mold’s top contender for environmental contagion is lithium. While technically present in the database, litium had five entries so I ignored it. I haven’t investigated but my guess is no one tests for lithium.
  • It’s rare, but some zip codes have multiple water suppliers, and the spreadsheet treats them as two separate entities that coincidentally have the same obesity prevalence.
  • I’ve made no attempt to back out basic confounding variables like income or age.
  • “% obese” is a much worse metric than average BMI, which is itself a much worse metric than % body fat. 
  • None of those metrics would catch if a contaminant makes some people very fat while making others thin ( SMTM thinks paradoxical effects are a big deal, so this is a major gap for testing their model).
  • Correlation still does not equal causation.

The correlations (for contaminants with >10k entries):

ContaminantCorrelation# Samples
Nitrate-0.03921430
Total haloacetic acids (HAAs)0.05514666
Chloroform0.04615065
Barium (total)0.04017929
Total trihalomethanes (TTHMs)0.11721184
Copper-0.00217113
Dibromochloromethane0.08013856
Nitrate & nitrite0.03511902
Bromodichloromethane0.07914238
Lead (total)-0.00613031
Dichloroacetic acid-0.00310159

Of these, the only one that looks interesting is trihalomethanes, a chemical group that includes chloroform. Here’s the graph:

Visually this looks like the floor is rising much faster than the ceiling, but in a conversation on twitter SMTM suggested that’s an artifact of the biviariate distribution, it disappears if you look at log normal. 

Very casual googling suggests that TTHMs are definitely bad for pregnancy in sufficient quantities, and are maybe in a complicated relationship with Type 2 diabetes, but no slam dunks.

This is about as far as I’ve had time to get. My conclusions alas are not very actionable, but maybe someone else can do something interesting with the data.

Thanks to Austin Chen for zipping the two data sets together, Daniel Filan for doing additional data processing and statistical analysis, and my Patreon patrons for supporting this research.

Home Antigen Tests Aren’t Useful For Covid Screening

Epistemic status: I strongly believe this is the right conclusion given the available data. The best available data is not that good, and if better data comes out I reserve the right to change my opinion.

EDIT (4/27): In a development I consider deeply frustrating but probably ultimately good, the same office is now getting much more useful information from antigen tests. They aren’t tracking with same rigor so I can’t comapre results, but they are now beating the bar of “literally ever noticing covid”.

In an attempt to avoid covid without being miserable, many of my friends are hosting group events but requiring attendees to take a home covid test beforehand. Based on data from a medium-sized office, I believe testing for covid with the tests people are using, to be security theater and provide no decrease in riskAntigen tests don’t work for covid screening. There is a more expensive home test available that provides some value, and rapid PCR may still be viable.

It’s important to distinguish between test types here: antigen tests look for viral proteins, and genetic amplification tests amplify viral RNA until it reaches detectable levels. The latter are much more sensitive. Most home tests are antigen tests, with the exception of Cue, which uses NAAT (a type of genetic amplification). An office in the bay area used aggressive testing with both Cue and antigen tests to control covid in the office and kept meticulous notes, which they were kind enough to share with me. Here are the aggregated numbers: 

  • The office requested daily Cue tests from workers. I don’t know how many people this ultimately included, probably low hundreds? I expect compliance was >95% but not perfect.
    • The results are from January when the dominant strain was Omicron classic, but no one got strain tested.
  • 39 people had at least one positive Cue test, all of which were either asymptomatic or ambiguously symptomatic (e.g. symptoms could be explained by allergies) at the time, and 27 of which had recent negative cue tests (often but not always the day before, sometimes the same day)
  • Of these, 10 definitely went on to develop symptoms, 7 definitely did not, and 18 were ambiguous (and a few were missing data).
  • 33 people with positives were retested with cue tests, of which 9 were positive. 
  • Of those 24 who tested positive and then negative, 4 tested positive on tests 3 or 4.
  • Of the 20 people with a single positive test followed by multiple negative retests, 6 went on to develop symptoms.
  • 0 people tested positive on antigen tests. There was not a single positive antigen test across this group. They not only didn’t catch covid as early as Cue did, they did not catch any cases at all, including at least 2 people who took the tests while experiencing definitive systems.
    • Antigen tests were a mix of Binax and QuickVue.
    • Early cases took multiple antigen tests over several days, later cases stopped bothering entirely.
    • The “negative test while symptomatic” count is artificially low because I excluded people with ambiguous symptoms, and because later infectees didn’t bother with antigen tests. 
    • I suppose I can’t rule out the possibility that they had an unrelated disease with similar symptoms and a false positive on the Cue test. But it seems unlikely that that happened 10-28 times out a few hundred people without leaving other evidence.

A common defense of antigen tests is that they detect whether you’re contagious at that moment, not whether you will eventually become contagious. Given the existence of people who tested antigen-negative while Cue-positive and symptomatic, I can’t take that seriously.

Unfortunately Cue tests are very expensive. You need a dedicated reader, which is $250, and tests are $65 each (some discount if you sign up for a subscription). A reader can only run 1 test at a time and each test takes 30 minutes, so you need a lot for large gatherings even if people stagger their entrances. 

My contact’s best guess is that the aggressive testing reduced but did not eliminate in-office spread, but it’s hard to quantify because any given case could have been caught outside the office, and because they were trying so many interventions at once. Multiple people tested positive, took a second test right away, and got a negative result, some of whom went on to develop symptoms; we should probably assume the same chance of someone testing negative when a second test would have come back positive, and some of those would have been true positives. So even extremely aggressive testing has gaps.

Meanwhile, have I mentioned lately how good open windows and air purifiers are for covid? And other illnesses, and pollution? And that taping a HEPA filter to a box fan is a reasonable substitute for an air purifier achievable for a very small number of dollars? Have you changed your filter recently? 

PS. Before you throw your antigen tests out, note that they are more useful than Cue tests for determining if you’re over covid. Like PCR, NAAT can continue to pick up dead RNA for days, maybe weeks, after you have cleared the infection. A negative antigen test after symptoms have abated and there has been at least one positive test is still useful evidence to me. 

PPS. I went through some notes and back in September I estimated that antigen testing would catch 25-70% of presymptomatic covid cases. Omicron moves faster, maybe faster enough that 25% was reasonable for delta, but 70% looks obviously too high now. 

PPPS. Talked to another person at the office, their take is the Cue tests are oversensitive. I think this fits the data worse but feel obliged to pass it on since they were there and I wasn’t.

PPPPS (5/02): multiple people responded across platforms that they had gotten positive antigen tests. One or two of these was even presymptomatic. I acknowledge the existence proof but will not be updating until the data has a denominator. If you’re doing a large event like a conference I encourage you to give everyone both cue, antigen, and rapid PCR tests and record their results, and who eventually gets sick. If you’d like help designing this experiment in more detail please reach out (elizabeth-at-acesounderglass.com)

I Caught Covid And All I Got Was This Lousy Ambiguous Data

Tl;dr I tried to run an n of 1 study on niacin and covid, and it failed to confirm or disprove anything at all.

You may remember that back in October I published a very long post investigating a niacin-based treatment protocol for long covid. My overall conclusion was “seems promising but not a slam dunk; I expect more rigorous investigation to show nothing but we should definitely check”. 

Well recently I got covid and had run out of more productive things I was capable of doing, so decided to test the niacin theory. I learned nothing but it was a lot of effort and I deserve a blog post out of it null results are still results so I’m sharing anyway.

Background On Niacin

Niacin is a B-vitamin used in a ton of metabolic processes. If you’re really curious, I describe it in excruciating detail in the original post.

All B vitamins are water-soluble, and it is generaly believed that unless you take unbelievably stupid doses you will pee out any excess intake without noticing. It’s much harder to build up stores of water-soluble vitamins than fat vitamins, so you need a more regular supply.  Niacin is a little weird among the water-solubles in that it gives very obvious signs of overdose: called flush, the symptoms consist of itchy skin and feeling overheated. Large doses can lead to uncontrolled shaking, but why would you ever take that much, when it’s so easy to avoid?

People regularly report response patterns that sure look like their body has a store of niacin that can be depleted and refilled over time. A dose someone has been taking for weeks or months will suddenly start giving them flush, and if they don’t lower it the flush symptoms will get worse and worse. 

Some forms of niacin don’t produce flush. Open question if those offer the same benefits with no side effects, offer fewer benefits, or are completely useless.

Niacin And Long Covid

There’s an elaborate hypothesis about how covid depletes niacin (and downstream products), and this is a contributor to long covid. My full analysis is here. As of last year I hadn’t had covid (this is antibody test confirmed, I definitely didn’t have an asymptomatic case) but I did have lingering symptoms from my vaccine and not a lot else to try, so I gave the protocol a shot.

My experience was pretty consistent with the niacin-storage theory. I spent a long time at quite a high dose of the form of niacin the protocol recommends, nictonic acid. My peak dose without flush was at  least 250mg (1563% RDA) and maybe even 375mg (2345% RDA). When I hit my limit I lowered my dose until I started getting flush at the new dose, and eventually went off nicotnic acid entirely (although I restarted a B-vitamin that included 313% RDA of a different form). That ended in September or early October 2021. It made no difference in my lingering vaccine symptoms.

In early 2022 I tried nicotinic acid again. Even ¼ tablet (62.5mg, 390% RDA) gave me flush.

I Get Covid

Once I developed symptoms and had done all the more obviously useful things like getting Paxlovid, I decided it would be fun to test myself with niacin (and the rest of the supplement stack discussed in my post) and see if covid had any effect. So during my two weeks of illness and week of recovery I occasionally took nicotinic acid and recorded my results. Here’s the overall timeline:

  1. Day -2: am exposed to covid.
  2. Day 0: test positive on a cue test (a home test that uses genetic amplification).
    1. Lung capacity test: 470 (over 400 is considered health).
    2. Start Fluvamoxine and the vitamin cocktail, although I’m inconsistent with both the new and existing vitamins during the worst of the illness. Vitamin cocktail includes 313% RDA of no-flush niacin, but not nicotinic acid. 
  3. Day 1: symptomatic AF. 102.3 degree fever, awake only long enough to pee, refill my water, and make sure my O2 saturation isn’t going to kill me. I eat nothing the entire day.
    1. I monitored my O2 throughout this adventure but it never went into a dangerous zone so I’m leaving it out of the rest of the story.
  4. Day 2: start with 99 degree fever, end day with no fever. Start Paxlovid.
    1. Every day after this I am awake a little bit longer, eat a little bit more, and have a little more cognitive energy, although it takes a while to get back to normal. 
    2. Try ¼ tab nicotinic acid (62.5 mg/ 375% RDA), no flush.
    3. Lung capacity troughs at 350 (considered orange zone).
  5. Day 4: ½ tablet nictonic acid, mild flush.
  6. Day 7: lung capacity up to 450, it will continue to vary from 430-450 for the next two weeks before occasionally going higher.
  7. Day 9: ½ tablet nictonic acid, mild flush
  8. Day 10-17: ⅓ tablet nictonic acid, no flush
    1. Where by “⅓” tablet I mean “I bit off an amount of pill that was definitely >¼ and <½ and probably averaged to ~⅓ over time”
  9. Day 12: I test positive on a home antigen test
  10. Day 15: I test negative on a home antigen test (no tests in between) 
  11. Day 17: ⅓ tablet produces flush (and a second negative antigen test)
    1. This was also the first day I left my house. I had thought of myself as still prone to fatigue but ended up having a lot of energy once I got out of my house and have been pretty okay since.

Conclusions

My case of covid was about as bad as you get while still technically counting as mild. Assuming I went into it with niacin stores such that 62.5mg nicotinic acid would generate flush, it looks like covid immediately took a small bite out of them. Or it reduced my absorption of vitamins, such that the same oral dosage resulted in less niacin being taken in. There’s no way to know covid had a larger effect on niacin than other illnesses, because I don’t have any to compare it to. Or maybe the whole thing was an artifact of “not eating for two days, and then only barely, and being inconsistent with my vitamins for a week”.

Bazant: An alternate covid calculator

Most of what I see people use Microcovid.org for now is estimating risk for large gatherings, which it was not designed for and thus doesn’t handle very well. I spent a few hours going through every covid calculator I could find and this calculator from the Bazant lab at MIT, while less user-friendly than Microcovid and having some flaws of its own, is tailored made for calculating risks for groups indoors, and I think it is worth a shot. 

[Note: I’ll be discussing the advanced version of the calculator here; I found the basic version too limited]

The Bazant calculator comes out of physics lab with a very detailed model of how covid particles hang and decay in the air, and how this is affected by ventilation and filtration. I haven’t checked their model, but I never checked Microcovid’s model either. The Bazant calculator lets you very finely adjust the parameters of a room: dimensions, mechanical ventilation, air filtration, etc. It combines those with more familiar parameters like vaccination and mask usage and feeds them into the model in this paper to produce an estimate of how long N people can be in a room before they accumulate a per-person level of risk between 0 and 1 (1 = person is definitely getting covid = 1,000,000 microcovids per person; .1 = 10% chance someone gets sick = 100,000 microcovids per person). It also produces an estimate of how much CO2 should accumulate over that time, letting you use a CO2 monitor to check its work and notice if risk is accumulating more rapidly than expected.

Reasons/scenarios to use the Bazant calculator over Microcovid:

  • You have a large group and want to set % immunized or effective mask usage for the group as a whole, instead of configuring everyone’s vaccinations and masks individually.
  • You want to incorporate the mechanics of the room and ventilation in really excruciating detail. 
  • You want to set your own estimate for prevalence based on beliefs about your subpopulation.
  • You want a live check on your work, in the form of the CO2 estimates.

Reasons to use Microcovid instead:

  • Your scenario is outside – Bazant calculator doesn’t handle this at all.
  • You don’t want to have an opinion on infection prevalence, immunization, or mask usage.
  • Your masks are better than surgical masks (Bazant doesn’t handle N95 or similar. Also, it rates surgical masks as 90% effective, which seems very high to me).
  • Your per-person risk tolerance is < 10,000 microcovids (Bazant calculator can’t bet set at a lower risk tolerance, although you can do math on their results to approximate this).
  • You’re still using a bubble model, or tracking accumulated risk rather than planning for an event.

Scenarios neither handle well

  • Correlated risk. You might be fine with 10% of your attendees getting sick, but not a 10% chance of all of the attendees getting sick at once.
  • Differences in risk from low-dose vs. high-dose exposures.

I’m not currently planning any big events, but if someone else is, please give this a try and let us know if it is useful. 

Epistemic Legibility

Tl;dr: being easy to argue with is a virtue, separate from being correct.

Introduction

Regular readers of my blog know of my epistemic spot check series, where I take claims (evidential or logical) from a work of nonfiction and check to see if they’re well supported. It’s not a total check of correctness: the goal is to rule out things that are obviously wrong/badly formed before investing much time in a work, and to build up my familiarity with its subject. 

Before I did epistemic spot checks, I defined an easy-to-read book as, roughly, imparting an understanding of its claims with as little work from me as possible. After epistemic spot checks, I started defining easy to read as “easy to epistemic spot check”. It should be as easy as possible (but no easier) to identify what claims are load-bearing to a work’s conclusions, and figure out how to check them. This is separate from correctness: things can be extremely legibly wrong. The difference is that when something is legibly wrong someone can tell you why, often quite simply. Illegible things just sit there at an unknown level of correctness, giving the audience no way to engage.

There will be more detailed examples later, but real quick: “The English GDP in 1700 was $890324890. I base this on $TECHNIQUE interpretation of tax records, as recorded in $REFERENCE” is very legible (although probably wrong, since I generated the number by banging on my keyboard). “Historically, England was rich” is not. “Historically, England was richer than France” is somewhere in-between. 

“It was easy to apply this blog post format I made up to this book” is not a good name, so I’ve taken to calling the collection of traits that make things easy to check “epistemic legibility”, in the James C. Scott sense of the word legible. Legible works are (comparatively) easy to understand, they require less external context, their explanations scale instead of needing to be tailored for each person. They’re easier to productively disagree with, easier to partially agree with instead of forcing a yes or no, and overall easier to integrate into your own models.

[Like everything in life, epistemic legibility is a spectrum, but I’ll talk about it mostly as a binary for readability’s sake]

When people talk about “legible” in the Scott sense they often mean it as a criticism, because pushing processes to be more legible cuts out illegible sources of value. One of the reasons I chose the term here is that I want to be very clear about the costs of legibility and the harms of demanding it in excess. But I also think epistemic legibility leads people to learn more correct things faster and is typically underprovided in discussion.

If I hear an epistemically legible argument, I have a lot of options. I can point out places I think the author missed data that impacts their conclusion, or made an illogical leap. I can notice when I know of evidence supporting their conclusions that they didn’t mention. I can see implications of their conclusions that they didn’t spell out. I can synthesize with other things I know, that the author didn’t include.

If I hear an illegible argument, I have very few options. Perhaps the best case scenario is that it unlocks something I already knew subconsciously but was unable to articulate, or needed permission to admit. This is a huge service! But if I disagree with the argument, or even just find it suspicious, my options are kind of crap. I write a response of equally low legibility, which is unlikely to improve understanding for anyone. Or I could write up a legible case for why I disagree, but that is much more work than responding to a legible original, and often more work than went into the argument I’m responding to, because it’s not obvious what I’m arguing against.  I need to argue against many more things to be considered comprehensive. If you believe Y because of X, I can debate X. If you believe Y because …:shrug:… I have to imagine every possible reason you could do so, counter all of them, and then still leave myself open to something I didn’t think of. Which is exhausting.

I could also ask questions, but the more legible an argument is, the easier it is to know what questions matter and the most productive way to ask them. 

I could walk away, and I am in fact much more likely to do that with an illegible argument. But that ends up creating a tax on legibility because it makes one easier to argue with, which is the opposite of what I want.

Not everything should be infinitely legible. But I do think more legibility would be good on most margins, that choices of the level of legibility should be made more deliberately, and that we should treat highly legible and illegible works more differently than we currently do. I’d also like a common understanding of legibility so that we can talk about its pluses and minuses, in general or for a particular piece.

This is pretty abstract and the details matter a lot, so I’d like to give some better examples of what I’m gesturing at. In order to reinforce the point that legibility and correctness are orthogonal; this will be a four quadrant model. 

True and Legible

Picking examples for this category was hard. No work is perfectly true and perfectly legible, in the sense of being absolutely impossible to draw an inaccurate conclusion from and having no possible improvements to legibility, because reality is very complicated and communication has space constraints. Every example I considered, I could see a reason someone might object to it. And the things that are great at legibility are often boring. But it needs an example so…

Acoup

Bret Devereaux over at Acoup consistently writes very interesting history essays that I found both easy to check and mostly true (although with some room for interpretation, and not everyone agrees). Additionally, a friend of mine who is into textiles tells me his textile posts were extremely accurate. So Devereaux does quite well on truth and legibility, despite bringing a fair amount of emotion and strong opinions to his work. 

As an example, here is a paragraph from a post arguing against descriptions of Sparta as a highly equal society.

But the final word on if we should consider the helots fully non-free is in their sanctity of person: they had none, at all, whatsoever. Every year, in autumn by ritual, the five Spartan magistrates known as the ephors (next week) declared war between Sparta and the helots – Sparta essentially declares war on part of itself – so that any spartiate might kill any helot without legal or religious repercussions (Plut. Lyc. 28.4; note also Hdt. 4.146.2). Isocrates – admittedly a decidedly anti-Spartan voice – notes that it was a religious, if not legal, infraction to kill slaves everywhere in Greece except Sparta (Isoc. 12.181). As a matter of Athenian law, killing a slave was still murder (the same is true in Roman law). One assumes these rules were often ignored by slave-holders of course – we know that many such laws in the American South were routinely flouted. Slavery is, after all, a brutal and inhuman institution by its very nature. The absence of any taboo – legal or religious – against the killing of helots marks the institution as uncommonly brutal not merely by Greek standards, but by world-historical standards.

Here we have some facts on the ground (Spartiates could kill their slaves, killing slaves was murder in most contemporaneous societies), sources for some but not all of them (those parentheticals are highly readable if you’re a classicist, and workable if you’re not), the inference he drew from them (Spartans treated their slaves unusually badly), and the conclusions he drew from that (Sparta was not only inequitable, it was unusually inequitable even for its time and place).

Notably, the entire post relies heavily on the belief that slavery is bad, which Devereaux does not bother to justify. That’s a good choice because it would be a complete waste of time for modern audiences – but it also makes this post completely unsuitable for arguing with anyone who disagreed. If for some reason you needed to debate the ethics of slavery, you need work that makes a legible case for that claim in particular, not work that takes it as an axiom.

Exercise for Mood and Anxiety

A few years ago I ESCed Exercise for Mood and Anxiety, a book that aims to educate people on how exercise can help their mental health and then give them the tools to do so. It did really well at the former: the logic was compelling and the foundational evidence was well cited and mostly true (although exercise science always has wide error bars). But out of 14 people who agreed to read the book and attempt to exercise more, only three reported back to me and none of them reported an increase in exercise. So EfMaA is true and epistemically legible, but nonetheless not very useful. 

True but Epistemically Illegible

You Have About Five Words is a poetic essay from Ray Arnold. The final ~paragraph is as follows:

If you want to coordinate thousands of people…

You have about five words.

This has ramifications on how complicated a coordinated effort you can attempt.

What if you need all that nuance and to coordinate thousands of people? What would it look like if the world was filled with complicated problems that required lots of people to solve?

I guess it’d look like this one.

I think the steelman of its core claim, that humans are bad at remembering long nuanced writing and the more people you are communicating with, the more you need to simplify your writing, is obviously true. This is good, because Ray isn’t doing crap to convince me of it. He cites no evidence and gives no explanation of his logic. If I thought nuance increased with the number of readers I would have nothing to say other than “no you’re wrong” or write my own post from scratch, because he gives no hooks to refute. If someone tried to argue that you get ten words rather than five, I would think they were missing the point. If I thought he had the direction right but got the magnitude of the effect wrong enough that it mattered (and he was a stranger rather than a friend), I would not know where to start the discussion.

[Ray gets a few cooperation points back by explicitly labeling this as poetry, which normally I would be extremely happy about, but it weakened its usefulness as an example for this post so right this second I’m annoyed about it.]

False but Epistemically Legible

Mindset

I think Carol Dweck’s Mindset and associated work is very wrong, and I can produce large volumes on specific points of disagreement. This is a sign of a work that is very epistemically legible: I know what her cruxes are, so I can say where I disagree. For all the shit I’ve talked about Carol Dweck over the years, I appreciate that she made it so extraordinarily easy to do so, because she was so clear on where her beliefs came from. 

For example, here’s a quote from Mindset

All children were told that they had performed well on this problem set: “Wow, you did very well on these problems. You got [number of problems] right. That’s a really high score!” No matter what their actual score, all children were told that they had solved at least 80% of the problems that they answered.

Some children were praised for their ability after the initial positive feedback: “You must be smart at these problems.” Some children were praised for their effort after the initial positive feedback: “You must have worked hard at these problems.” The remaining children were in the control condition and received no additional feedback.

And here’s Scott Alexander’s criticism

This is a nothing intervention, the tiniest ghost of an intervention. The experiment had previously involved all sorts of complicated directions and tasks, I get the impression they were in the lab for at least a half hour, and the experimental intervention is changing three short words in the middle of a sentence.

And what happened? The children in the intelligence praise condition were much more likely to say at the end of the experiment that they thought intelligence was more important than effort (p < 0.001) than the children in the effort condition. When given the choice, 67% of the effort-condition children chose to set challenging learning-oriented goals, compared to only 8% (!) of the intelligence-condition. After a further trial in which the children were rigged to fail, children in the effort condition were much more likely to attribute their failure to not trying hard enough, and those in the intelligence condition to not being smart enough (p < 0.001). Children in the intelligence condition were much less likely to persevere on a difficult task than children in the effort condition (3.2 vs. 4.5 minutes, p < 0.001), enjoyed the activity less (p < 0.001) and did worse on future non-impossible problem sets (p…you get the picture). This was repeated in a bunch of subsequent studies by the same team among white students, black students, Hispanic students…you probably still get the picture.

Scott could make those criticisms because Dweck described her experiment in detail. If she’d said “we encouraged some kids and discouraged others”, there would be a lot more ambiguity.

Meanwhile, I want to criticize her for lying to children. Messing up children’s feedback system creates the dependencies on adult authorities that lead to problems later in life. This is extremely bad even if it produces short-term improvements (which it doesn’t). But I can only do this with confidence because she specified the intervention.

The Fate of Rome

This one is more overconfident than false. The Fate of Rome laid out very clearly how they were using new tools for recovering meteorological data to determine the weather 2000 years ago, and using that to analyze the Roman empire. Using this new data, it concludes that the peak of Rome was at least partially caused by a prolonged period of unusually good farming weather in the Mediterranean, and that the collapse started or was worsened when the weather began to regress to the mean.

I looked into the archeometeorology techniques and determined that they, in my judgement, had wider confidence intervals than the book indicated, which undercut the causality claims. I wish the book had been more cautious with its evidence, but I really appreciate that they laid out their reasoning so clearly, which made it really easy to look up points I might disagree with them on.

False and Epistemically Illegible

Public Health and Airborne Pathogen Transmission

I don’t know exactly what the CDC’s or WHO’s current stance is on breathing-based transmission of covid, and I don’t care, because they were so wrong for so long in such illegible ways. 

When covid started, the CDC and WHO’s story was that it couldn’t be “airborne”, because the viral particle was > 5 microns.  That phrasing was already anti-legible for material aimed at the general public, because airborne has a noticeably different definition in virology (”can persist in the air indefinitely”) than it does for popular use (”I can catch this through breathing”). But worse than that, they never provided any justification for the claim. This was reasonable for posters, but not everything was so space constrained, and when I looked in February 2021 I could not figure out where the belief that airborne transmission was rare was coming from. Some researcher eventually spent dozens to hundreds of hours on this and determined the 5 micron number probably came from studies of tuberculosis, which for various reasons needs to get deeper in the lungs than most pathogens and thus has stronger size constraints. If the CDC had pointed to their sources from the start we could have determined the 5 micron limit was bullshit much more easily (the fact that many relevant people accepted it without that proof is a separate issue).

When I wrote up the Carol Dweck example, it was easy. I’m really confident in what Carol Dweck believed at the time of writing Mindset, so it’s really easy to describe why I disagree. Writing this section on the CDC was harder, because I cannot remember exactly what the CDC said and when they said it; a lot of the message lived in implications; their statements from early 2020 are now memory holed and while I’m sure I could find them on archive.org, it’s not really going to quiet the nagging fear that someone in the comments is going to pull up a different thing they said somewhere else that doesn’t say exactly what I claimed they said, or that I view as of a piece with what I cited but both statements are fuzzy enough that it would be a lot of work to explain why I think the differences are immaterial….

That fear and difficulty in describing someone’s beliefs is the hallmark of epistemic illegibility. The wider the confidence interval on what someone is claiming, the more work I have to do to question it.

And More…

The above was an unusually legible case of illegibility. Mostly illegible and false arguments don’t feel like that. They just feel frustrating and bad and like the other person is wrong but it’s too much work to demonstrate how. This is inconveniently similar to the feeling when the other person is right but you don’t want to admit it. I’m going to gesture some more at illegibility here, but it’s inherently an illegible concept so there will be genuinely legible (to someone) works that resemble these points, and illegible works that don’t.

Marks of probable illegibility:

  • The person counters every objection raised, but the counters aren’t logically consistent with each other. 
  • You can’t nail down exactly what the person actually believes. This doesn’t mean they’re uncertain – saying “I think this effect is somewhere between 0.1x and 10000x” is very legible, and sometimes the best you can do given the data. It’s more that they imply a narrow confidence band, but the value that band surrounds moves depending on the subargument. Or they agree they’re being vague but they move forward in the argument as if they were specific. 
  • You feel like you understand the argument and excitedly tell your friends. When they ask obvious questions you have no answer or explanation. 

A good example of illegibly bad arguments that are specifically trying to ape legibility are a certain subset of alt-medicine advertisements. They start out very specific, with things like “there are 9804538905 neurons in your brain carrying 38923098 neurotransmitters”, with rigorous citations demonstrating those numbers. Then they introduce their treatment in a way that very strongly implies it works with those 38923098 transmitters but not, like, what it does to them or why we would expect that to have a particular effect. Then they wrap it up with some vague claims about wellness, so you’re left with the feeling you’ll definitely feel better if you take their pill, but if you complain about any particular problem it did not fix they have plausible deniability.

[Unfortunately the FDA’s rules around labeling encourage this illegibility even for products that have good arguments and evidence for efficacy on specific problems, so the fact that a product does this isn’t conclusive evidence it’s useless.]

Bonus Example: Against The Grain

The concept of epistemic legibility was in large part inspired by my first attempt at James C. Scott’s Against the Grain (if that name seems familiar: Scott also coined “legibility” in the sense in which I am using it), whose thesis is that key properties of grains (as opposed to other domesticates) enabled early states. For complicated reasons I read more of AtG without epistemic checking than I usually would, and then checks were delayed indefinitely, and then covid hit, and then my freelancing business really took off… the point is, when I read Against the Grain in late 2019, it felt like it was going to be the easiest epistemic spot check I’d ever done. Scott was so cooperative in labeling his sources, claims, and logical conclusions. But when I finally sat down to check his work, I found serious illegibilities.

I did the spot check over Christmas this year (which required restarting the book). It was maybe 95% as good as I remembered, which is extremely high. At chapter 4 (which is halfway through the book, due to the preface and introduction), I felt kinda overloaded and started to spot check some claims (mostly factual – the logical ones all seemed to check out as I read them). A little resentfully, I checked this graph.

This should have been completely unnecessary, Scott is a decent writer and scientist who was not going to screw up basic dates. I even split the claims section of the draft into two sections, “Boring” and “Interesting”, because I obviously wasn’t going to come up with anything checking names and dates and I wanted that part to be easy to skip.

I worked from the bottom. At first, it was a little more useful than I expected – a major new interpretation of the data came out the same year the book was published, so Scott’s timing on anatomically modern humans was out of date, but not in a way that reflected poorly on him.

Finally I worked my way up to “first walled, territorial state”. Not thinking super hard, I googled “first walled city”, and got a date 3000 years before the one Scott cites. Not a big deal, he specified state, not walls. What I can google to find that out? “Earliest state”, obviously, and the first google hit does match Scott’s timing, but… what made something a state, and how can we assess those traits from archeological records? I checked, and nowhere in the preface, introduction, or first three chapters was “state” defined. No work can define every term it uses, but this is a pretty important one for a book whose full title is Against the Grain: A Deep History of the Earliest States

You might wonder if “state” had a widespread definition such that it didn’t need to be defined. I think this is not the case for a few reasons. First, Against The Grain is aimed at a mainstream audience, and that requires defining terms even if they’re commonly known by experts. Second, even if a reader knew the common definition of what made a state, how you determine whether something was a state or merely a city from archeology records is crucial for understanding the inner gears of the book’s thesis. Third, when Scott finally gives a definition, it’s not the same as the one on wikipedia.

[longer explanation] Among these characteristics, I propose to privilege those that point to territoriality and a specialized state apparatus: walls, tax collection, and officials.

Against the Grain

States are minimally defined by anthropologist David S. Sandeford as socially stratified and bureaucratically governed societies with at least four levels of settlement hierarchy (e.g., a large capital, cities, villages, and hamlets)

Wikipedia (as of 2021-12-26)

These aren’t incompatible, but they’re very far from isomorphic. I expect that even though there’s a fairly well accepted definition of state in the relevant field(s), there are disputed edges that matter very much for this exact discussion, in which Scott views himself as pushing back against the commonly accepted narrative. 

To be fair, the definition of state was not that relevant to chapters 1-3, which focus on pre-state farming. Unless, you know, your definition of “state” differs sufficiently from his. 

Against The Grain was indeed very legible in other ways, but loses basically all of its accrued legibility points and more for not making even a cursory definition of a crucial term in the introduction, and for doing an insufficient job halfway through the book.

This doesn’t mean the book is useless, but it does mean it was going to be more work to extract value from than I felt like putting in on this particular topic.

Why is this Important?

First of all, it’s costing me time.

I work really hard to believe true things and disbelieve false things, and people who argue illegibly make that harder, especially when people I respect treat arguments as more proven than their level of legibility allows them to be. I expect having a handle with which to say “no I don’t have a concise argument about why this work is wrong, and that’s a fact about the work” to be very useful.

More generally, I think there’s a range of acceptable legibility levels for a given goal, but we should react differently based on which legibility level the author chose, and that arguments will be more productive if everyone involved agrees on both the legibility level and on the proper response to a given legibility level. One rule I have is that it’s fine to declare something a butterfly idea and thus off limits to sharp criticism, but that inherently limits the calls to action you can make based on that idea. 

Eventually I hope people will develop some general consensus around the rights and responsibilities of a given level of legibility, and that this will make arguments easier and more productive. Establishing those rules is well beyond the scope of this post. 

Legibility vs Inferential Distance

You can’t explain everything to everyone all of the time. Some people are not going to have the background knowledge to understand a particular essay of yours. In cases like this, legibility is defined as “the reader walks away with the understanding that they didn’t understand your argument”. Illegibility in this case is when they erroneously think they understand your argument. In programming terms, it’s the difference between a failed function call returning a useful error message (legible), versus failing silently (illegible).  

A particularly dangerous way this can occur is when you’re using terms of art (meaning: words or phrases that have very specific meanings within a field) that are also common English words. You don’t want someone thinking you’re dismissing a medical miracle because you called it statistically insignificant, or invalidating the concept of thought work because it doesn’t apply force to move an object.

Cruelly, misunderstanding becomes more likely the more similar the technical definition is to the English definition. I watched a friend use the term “common knowledge” to mean “everyone knows that everyone knows, and everyone knows that everyone knows… and that metaknoweldge enables actions that wouldn’t be possible if it was merely true that everyone knew and thought they were the only one, and those additional possible actions are extremely relevant to our current conversation” to another friend who thought “common knowledge” meant “knowledge that is common”, and had I not intervened the ensuing conversation would have been useless at best.

Costs of Legibility

The obvious ones are time and mental effort, and those should not be discounted. Given a finite amount of time, legibility on one work trades off against another piece being produced at all, and that may be the wrong call.

A second is that legibility can make things really dry. Legibility often means precision, and precision is boring, especially relative to work optimized to be emotionally activating. 

Beyond that, legibility is not always desirable. For example, unilateral legibility in an adversarial environment makes you vulnerable, as you’re giving people the keys to the kingdom of “effective lies to tell you”. 

Lastly, premature epistemic legibility kills butterfly ideas, which are beautiful and precious and need to be defended until they can evolve combat skills.

How to be Legible

This could easily be multiple posts, I’m including a how-to section here more to help convey the concept of epistemic legibility than write a comprehensive guide to how to do it. The list is not a complete list, and items on it can be faked. I think a lot of legibility is downstream of something harder to describe. Nonetheless, here are a few ways to make yourself more legible, when that is your goal.

  • Make it clear what you actually believe.
    • Watch out for implicit quantitative estimates (“probably”, “a lot”, “not very much”) and make them explicit, even if you have a very wide confidence interval. The goals here are twofold: the first is to make your thought process explicit to you. The second is to avoid confusion – people can mean different things by “many”, and I’ve seen some very long arguments suddenly resolve when both sides gave actual numbers.
  • Make clear the evidence you are basing your beliefs on.
    • This need not mean “scientific fact” or “RCT”. It could be “I experienced this a bunch in my life” or “gut feeling” or “someone I really trust told me so”. Those are all valid reasons to believe things. You just need to label them.
  • Make that evidence easy to verify.
    • More accessible sources are better.
      • Try to avoid paywalls and $900 books with no digital versions.
      • If it’s a large work, use page numbers or timestamps to the specific claim, removing the burden to read an entire book to check your work (but if your claim rests on a large part of the work, better to say that than artificially constrict your evidence)
    • One difficulty is when the evidence is in a pattern, and no one has rigorously collated the data that would let you demonstrate it. You can gather the data yourself, but if it takes a lot of time it may not be worth it. 
    • In times past, when I wanted to refer to a belief I had in a blog post but didn’t have a citation for it, I would google the belief and link to the first article that came up. I regret this. Just because an article agrees with me doesn’t mean it’s good, or that its reasoning is my reasoning. So one, I might be passing on a bad argument. Two, I know that, so if someone discredits the linked article it doesn’t necessarily change my mind, or even create in me a feeling of obligation to investigate. I now view it as more honest to say “I believe this but only vaguely remember the reasons why”, and if it ends up being a point of contention I can hash it out later.
  • Make clear the logical steps between the evidence and your final conclusion.
  • Use examples. Like, so many more examples than you think. Almost everything could benefit from more examples, especially if you make it clear when they’re skippable so people who have grokked the concept can move on.
    • It’s helpful to make clear when an example is evidence vs when it’s a clarification of your beliefs. The difference is if you’d change your mind if the point was proven false: if yes, it’s evidence. If you’d say “okay fine, but there are a million other cases where the principle holds”, it’s an example.  One of the mistakes I made with early epistemic spot checks was putting too much emphasis on disproving examples that weren’t actually evidence.
  • Decide on an audience and tailor your vocabulary to them. 
    • All fields have words that mean something different in the field than in general conversation, like “work”, “airborne”, and “significant”. If you’re writing within the field, using those terms helps with legibility by conveying a specific idea very quickly. If you’re communicating outside the field, using such terms without definition hinders legibility, as laypeople misapply their general knowledge of the English language to your term of art and predictably get it wrong. You can help on the margins by defining the term in your text, but I consider some uses of this iffy.
      • The closer the technical definition of a term is to its common usage, the more likely this is to be a problem because it makes it much easier for the reader to think they understand your meaning when they don’t.
    • At first I wanted to yell at people who use terms of art in work aimed at the general population, but sometimes it’s unintentional, and sometimes it’s a domain expert who’s bad at public speaking and has been unexpectedly thrust onto a larger stage, and we could use more of the latter, so I don’t want to punish people too much here. But if you’re, say, a journalist who writes a general populace book but uses an academic term of art in a way that will predictably be misinterpreted, you have no such excuse and will go to legibility jail. 
    • A skill really good interviewers bring to the table is recognizing terms of art that are liable to confuse people and prompting domain experts to explain them.
  • Write things down, or at least write down your sources. I realize this is partially generational and Gen Z is more likely to find audio/video more accessible than written work, and accessibility is part of legibility. But if you’re relying on a large evidence base it’s very disruptive to include it in audio and very illegible to leave it out entirely, so write it down.
  • Follow all the rules of normal readability – grammar, paragraph breaks, no run-on sentences, etc.

A related but distinct skill is making your own thought process legible. John Wentworth describes that here.

Synthesis

“This isn’t very epistemically legible to me” is a valid description (when true), and a valid reason not to engage. It is not automatically a criticism.

“This idea is in its butterfly stage”, “I’m prioritizing other virtues” or “this wasn’t aimed at you” are all valid defenses against accusations of illegibility as a criticism (when true), but do not render the idea more legible.

“This call to action isn’t sufficiently epistemically legible to the people it’s aimed at” is an extremely valid criticism (when true), and we should be making it more often.

I apologize to Carol Dweck for 70% of the vigor of my criticism of her work; she deserves more credit than I gave her for making it so easy to do that. I still think she’s wrong, though.

Epilogue: Developing a Standard for Legibility

As mentioned above, I think the major value add from the concept of legibility is that it lets us talk about whether a given work is sufficiently legible for its goal. To do this, we need to have some common standards for how much legibility a given goal demands. My thoughts on this are much less developed and by definition common standards need to be developed by the community that holds them, not imposed by a random blogger, so I’ll save my ideas for a different post. 

Epilogue 2: Epistemic Cooperation

Epistemic legibility is part of a broader set of skills/traits I want to call epistemic cooperation. Unfortunately, legibility is the only one I have a really firm handle on right now (to the point I originally conflated the concepts, until a few conversations highlighted the distinction- thanks friends!). I think epistemic cooperation, in the sense of “makes it easy for us to work together to figure out the truth” is a useful concept in its own right, and hope to write more about it as I get additional handles. In the meantime, there are a few things I want to highlight as increasing or signalling cooperation in general but not legibility in particular:

  • Highlight ways your evidence is weak, related things you don’t believe, etc.
  • Volunteer biases you might have.
  • Provide reasons people might disagree with you.
  • Don’t emotionally charge an argument beyond what’s inherent in the topic, but don’t suppress emotion below what’s inherent in the topic either.
  • Don’t tie up brain space with data that doesn’t matter.

Thanks to Ray Arnold, John Salvatier, John Wentworth, and Matthew Graves for discussion on this post. 

Butterfly Ideas

Or “How I got my hyperanalytical friends to chill out and vibe on ideas for 5 minutes before testing them to destruction”

Sometimes talking with my friends is like intellectual combat, which is great. I am glad I have such strong cognitive warriors on my side. But not all ideas are ready for intellectual combat. If I don’t get my friend on board with this, some of them will crush an idea before it gets a chance to develop, which feels awful and can kill off promising avenues of investigation. It’s like showing a beautiful, fragile butterfly to your friend to demonstrate the power of flight, only to have them grab it and crush it in their hands, then point to the mangled corpse as proof butterflies not only don’t fly, but can’t fly, look how busted their wings are.

You know who you are

When I’m stuck in a conversation like that, it has been really helpful to explicitly label things as butterfly ideas. This has two purposes. First, it’s a shorthand for labeling what I want (nurturance and encouragement). Second, it explicitly labels the idea as not ready for prime time in ways that make it less threatening to my friends. They can support the exploration of my idea without worrying that support of exploration conveys agreement, or agreement conveys a commitment to act.

This is important because very few ideas start out ready for the rigors of combat. If they’re not given a sheltered period, they will die before they become useful. This cuts us off from a lot of goodness in the world. Examples:

  • A start-up I used to work for had a keyword that meant “I have a vague worried feeling I want to discuss without justifying”. This let people bring up concerns before they had an ironclad case for them and made statements that could otherwise have felt like intense criticism feel more like information sharing (they’re not asserting this will definitely fail, they’re asserting they have a feeling that might lead to some questions). This in turn meant that problems got brought up and addressed earlier, including problems in the classes “this is definitely gonna fail and we need to make major changes” and  “this excellent idea but Bob is missing the information that would help him understand why”.
    • This keyword was “FUD (fear, uncertainty, doubt)”. It is used in exactly the opposite way in cryptocurrency circles, where it means “you are trying to increase our anxiety with unfounded concerns, and that’s bad”. Words are tricky.
  • Power Buys You Distance From The Crime started out as a much less defensible seed of an idea with a much worse explanation. I know that had I talked about it in public it would have caused a bunch of unproductive yelling that made it harder to think because I did and it did (but later, when it was ready, intellectual combat with John Wentworth improved the idea further).
  • The entire genre of “Here’s a cool new emotional tool I’m exploring”
  • The entire genre of “I’m having a feeling about a thing and I don’t know why yet”

I’ve been on the butterfly crushing end of this myself- I’m thinking of a particular case last year where my friend brought up an idea that, if true, would require costly action on my part. I started arguing with the idea, they snapped at me to stop ruining their dreams. I chilled out, we had a long discussion about their goals, how they interpreted some evidence, and why they thought a particular action might further said goals, etc. 

A week later all of my objections to the specific idea were substantiated and we agreed not to do the thing- but thanks to the conversation we had in the meantime, I have a better understanding of them and what kinds of things would be appealing to them in the future. That was really valuable to me and I wouldn’t have learned all that if I’d crushed the butterfly in the beginning.

Notably, checking out that idea was fairly expensive, and only worth it because this was an extremely close friend (which both made the knowledge of them more valuable, and increased the payoff to helping them if they’d been right). If they had been any less close, I would have said “good luck with that” and gone about my day, and that would have been a perfectly virtuous reaction. 

I almost never discuss butterfly ideas on the public internet, or even 1:many channels. Even when people don’t actively antagonize them, the environment of Facebook or even large group chats means that people often read with half their brain and respond to a simplified version of what I said. For a class of ideas that live and die by context and nuance and pre-verbal intuitions, this is crushing. So what I write in public ends up being on the very defensible end of the things I think. This is a little bit of a shame, because the returns to finding new friends to study your particular butterflies with is so high, but ce la vie. 

This can play out a few ways in practice. Sometimes someone will say “this is a butterfly idea” before they start talking. Sometimes when someone is being inappropriately aggressive towards an idea the other person will snap “will you please stop crushing my butterflies!” and the other will get it. Sometimes someone will overstep, read the other’s facial expression, and say “oh, that was a butterfly, wasn’t it?”. All of these are marked improvements over what came before, and have led to more productive discussions with less emotional pain on both sides.

A Quick Look At 20% Time

I was approached by a client to research the concept of 20% time for engineers, and they graciously agreed to let me share my results. Because this work was tailored to the needs of a specific client, it may have gaps or assumptions that make it a bad 101 post, but in the expectation that it is more useful than not publishing at all, I would like to share it (with client permission). 

Side project time, popularized as 20% time at Google, is a policy that allows employees to spend a set percentage of their time on a project of their choice, rather than one directed by management. In practice this can mean a lot of different things, ranging from “spend 20% of your time on whatever you want” to “sure, spend all the free time you want generating more IP for us, as long as your main project is completely unaffected” (often referred to as 120% time) to “theoretically you’re free to do whatever, but we’ve imposed so many restrictions that this means nothing”. I did a 4-hour survey to get a sense of what implementations were available and how they felt for workers.

A frustration here is that almost all of what I could find via Google searches were puff-pieces, anti-puff-pieces, and employees complaining on social media (and one academic article). The single best article I found came not through a Google search, but because I played D&D with the author 15 years ago and she saw me talking about this on Facebook. She can’t be the only one writing about 20% time in a thoughtful way and I’m mad that that writing has been crowded out by work that is, at best, repetitive, and at worst actively misleading.

There are enough anecdotal reports that I believe 20% time exists and is used to good effect by some employees at some companies (including Google) some of the time. The dearth of easily findable information on specific implementations, managerial approaches, trade-offs, etc, makes me downgrade my estimate of how often that happens, vs 20% time being a legible signal of an underlying attitude towards autonomy, or a dubious recruitment tool. I see a real market gap for someone to explain how to do 20% time well at companies of different sizes and product types.

But in the meantime, here’s the summary I gave my client. Reminder: this was originally intended for a high-context conversation with someone who was paying me by the hour, and as such is choppier, less nuanced, and has different emphases than ideal for a public blog post.  

My full notes are available here.

  • To the extent it’s measured, utilization appears to be low, so the policy doesn’t cost very much.
    • In 2015, a Google HR exec estimated utilization at 10% (meaning it took 2% of all employees’ time). 
    • In 2009, 12 months after Atlassian introduced 20% time, recorded utilization was at 5% (meaning employees were measured to spend 1.1% of their time on it) and estimated actual utilization was <=15% (Notably, nobody complains that Atlassian 20% is fake, and I confirmed with a recently departed employee that it was still around as of 2020).
  • Interaction with management and evaluation is key. A good compromise is to let people spend up to N hours on a project, and require a check-in with management beyond that. 
    • Googlers consistently (although not universally) complained on social media that even when 20% time was officially approved, you’d be a fool to use it if you wanted a promotion or raises. 
    • However a manager at a less famous company indicated this hadn’t been a problem for them, and that people who approached perf the way everyone does at Google would be doomed anyway. So it looks like you can get out of this with culture.
    • An approval process is the kiss of death for a feeling of autonomy, but letting employees work on garbage for 6 months and then holding it against them at review time hurts too. 
    • Atlassian requires no approval to start, 3 uninvolved colleagues to vouch for a project to go beyond 5 days, and founder approval at 10 days. This seems to be working okay for them (but see the “costs” section below).
  • Costs of 20% time:
    • Time cost appears to be quite low (<5% of employee time, some of which couldn’t have been spent on core work anyway)
    • Morale effects can backfire: Sometimes devs make tools or projects that are genuinely useful, but not useful enough to justify expanding or sometimes even maintaining them. This leads to telling developers they must give up on a project they value and enjoyed (bad for their morale) or an abundance of tools that developers value but are too buggy to really rely on (bad for other people’s morale). This was specifically called out as a problem at Atlassian.
    • Employees on small teams are less likely to feel able to take 20% time, because they see the burden of core work shifting to their co-workers. But being on a small team already increases autonomy, so that may not matter.
  • Benefits of 20% time:
    • New products. This appears to work well for companies that make the kind of products software developers are naturally interested in, but not otherwise.
    • The gain in autonomy generally causes the improvements in morale and thus productivity that you’d expect (unless it backfires), but no one has quantified them.
    • Builds slack into the dev pipeline, such that emergencies can be handled without affecting customers.
    • Lets employees try out new teams before jumping ship entirely.
    • Builds cross-team connections that pay off in a number of ways, including testing new teams.
    • Gives developers a valve to overrule bug fixes and feature requests that their boss rejected from the official roadmap.
  • There are many things to do with 20% time besides new products.
    • Small internal tools, QOL improvements, etc (but see “costs”).
    • Learning, which can mean classes, playing with new tools, etc.
    • Decreasing technical debt.
    • Non-technical projects, e.g. charity drives.
  • Other notes:
    • One person suggested 20% time worked better at Google when it hired dramatically overqualified weirdos to work on mundane tech, and as they started hiring people more suited to the task with less burning desire to be working on something else, utilization and results decreased. 
    • 20% or even 120% time has outsized returns for industries that have very high capital costs but minimal marginal costs, such that employees couldn’t do them at home. This was a big deal at 3M (a chemical company) and, for the right kind of nerd, big data.

Thanks to the anonymous client for commissioning this research and allowing me to share it, and my Patreon patrons for funding my writing it up for public consumption.