Polysemanticity is the idea that a single neuron or feature serves multiple meanings or roles depending on the context, something that is largely prevalent in LLMs. Essentially, an individual neuron or attention head can represent multiple ideas, patterns, or tasks instead of having a single fixed meaning.

For example, OpenAI’s post about Multimodal neurons in artificial neural networks demonstrates instances of neurons that capture a specific idea in their CLIP model, e.g: a spider neuron that lights up when it sees a spider or the text β€œspider” or spiderman. As models grow larger, individual neurons that represent a single concept grow more unlikely.

Polysemanticity occurs for two main reasons:

  1. Efficiency: Complex models fit a huge range of concepts into a limited number of parameters, so features are reused across different contexts, making the model more efficient
  2. Context-Dependent Interpretation: Polysemantic units adapt based on input data and surrounding context, so the same unit can contribute to varying behaviour based on different prompts.

Polysemanticity adds flexibility but also makes interpretability and explainability significantly harder. What do specific parts of the model β€œmean”? Understanding polysemanticity is essential for mechanistic interpretability.