This year I’ve been on sabbatical, and have spent my time upskilling in AI Safety. Part of that is doing independent research projects in different fields.
Some of those items have resulted in useful output, notably A Toy Model of the U-AND Problem, Do No Harm? and SAEs and their Variants.
And then there are others that I’ve just failed fast, and moved on.
I’ve detailed those projects that still have something to say, even if it’s mostly negative results.
Find them on LessWrong.