I’m not capable of working under the hood of a transformer, and I’m woefully inadequate because I’m not digital myself, but this idea of feature steering seems like a big deal. It means people can change how AI writes, reasons, elaborates, emotes. Humans can change one another’s minds by persuasion, but we can’t feature steer.
Claude with a training cutoff of April, 2024, days this;
“From research and observed patterns, here are the main types of behaviors you can steer:
1. Writing Style
- Formality level (casual to academic)
- Conciseness vs verbosity
- Simplicity vs complexity of language
- Tone (friendly, professional, technical)
2. Reasoning Patterns
- Step-by-step vs holistic explanations
- Depth of analysis (surface vs detailed)
- Degree of uncertainty expression
- Level of mathematical rigor
3. Domain Expertise
- Technical vocabulary density
- Field-specific conventions
- Citation frequency
- Jargon usage
4. Interaction Style
- Question frequency
- Empathy level
- Directiveness vs suggestiveness
- Tutorial vs peer discussion style
5. Output Structure
- List vs narrative format
- Use of examples/analogies
- Code vs prose ratio
- Visual/diagram suggestions
What's interesting is that these aren't binary switches - they're more like continuous spectrums you can adjust. Is there a particular spectrum here that interests you most?”
I assume Claude is simplified for me but on track. Am I on track? What does feature steering mean in practical terms for, say, a high school student?
Philosophically, does human capacity to turn the dials and mess with artificial brains mean humans really are the boss of AI? Could Hal be steered during his worst moments? Is FS really our fail safety?
It depends on how useful feature steering turns out to be in practice, but today it’s less powerful than other common alignment techniques like reinforcement learning from human feedback. I’m not sure if it has legs given the impact on capabilities, but we shall see!
I’m not capable of working under the hood of a transformer, and I’m woefully inadequate because I’m not digital myself, but this idea of feature steering seems like a big deal. It means people can change how AI writes, reasons, elaborates, emotes. Humans can change one another’s minds by persuasion, but we can’t feature steer.
Claude with a training cutoff of April, 2024, days this;
“From research and observed patterns, here are the main types of behaviors you can steer:
1. Writing Style
- Formality level (casual to academic)
- Conciseness vs verbosity
- Simplicity vs complexity of language
- Tone (friendly, professional, technical)
2. Reasoning Patterns
- Step-by-step vs holistic explanations
- Depth of analysis (surface vs detailed)
- Degree of uncertainty expression
- Level of mathematical rigor
3. Domain Expertise
- Technical vocabulary density
- Field-specific conventions
- Citation frequency
- Jargon usage
4. Interaction Style
- Question frequency
- Empathy level
- Directiveness vs suggestiveness
- Tutorial vs peer discussion style
5. Output Structure
- List vs narrative format
- Use of examples/analogies
- Code vs prose ratio
- Visual/diagram suggestions
What's interesting is that these aren't binary switches - they're more like continuous spectrums you can adjust. Is there a particular spectrum here that interests you most?”
I assume Claude is simplified for me but on track. Am I on track? What does feature steering mean in practical terms for, say, a high school student?
Philosophically, does human capacity to turn the dials and mess with artificial brains mean humans really are the boss of AI? Could Hal be steered during his worst moments? Is FS really our fail safety?
It depends on how useful feature steering turns out to be in practice, but today it’s less powerful than other common alignment techniques like reinforcement learning from human feedback. I’m not sure if it has legs given the impact on capabilities, but we shall see!
Thanks, Harry. Appreciate your work.