3 Comments
User's avatar
Terry underwood's avatar

I’m not capable of working under the hood of a transformer, and I’m woefully inadequate because I’m not digital myself, but this idea of feature steering seems like a big deal. It means people can change how AI writes, reasons, elaborates, emotes. Humans can change one another’s minds by persuasion, but we can’t feature steer.

Claude with a training cutoff of April, 2024, days this;

“From research and observed patterns, here are the main types of behaviors you can steer:

1. Writing Style

- Formality level (casual to academic)

- Conciseness vs verbosity

- Simplicity vs complexity of language

- Tone (friendly, professional, technical)

2. Reasoning Patterns

- Step-by-step vs holistic explanations

- Depth of analysis (surface vs detailed)

- Degree of uncertainty expression

- Level of mathematical rigor

3. Domain Expertise

- Technical vocabulary density

- Field-specific conventions

- Citation frequency

- Jargon usage

4. Interaction Style

- Question frequency

- Empathy level

- Directiveness vs suggestiveness

- Tutorial vs peer discussion style

5. Output Structure

- List vs narrative format

- Use of examples/analogies

- Code vs prose ratio

- Visual/diagram suggestions

What's interesting is that these aren't binary switches - they're more like continuous spectrums you can adjust. Is there a particular spectrum here that interests you most?”​​​​​​​​​​​​​​​​

I assume Claude is simplified for me but on track. Am I on track? What does feature steering mean in practical terms for, say, a high school student?

Philosophically, does human capacity to turn the dials and mess with artificial brains mean humans really are the boss of AI? Could Hal be steered during his worst moments? Is FS really our fail safety?

Expand full comment
Harry Law's avatar

It depends on how useful feature steering turns out to be in practice, but today it’s less powerful than other common alignment techniques like reinforcement learning from human feedback. I’m not sure if it has legs given the impact on capabilities, but we shall see!

Expand full comment
Terry underwood's avatar

Thanks, Harry. Appreciate your work.

Expand full comment