OpenAI has developed InstructGPT, a language model that improves upon GPT-3 in following user instructions, truthfulness, and reduced toxicity. This model uses reinforcement learning from human feedback (RLHF) to align with user intentions. InstructGPT outperforms GPT-3 in instruction following, even when compared to a much larger GPT-3 model. The training process involves human labelers who provide demonstrations and rank model outputs. This approach helps unlock existing capabilities of GPT-3 that were previously difficult to access through prompt engineering alone. InstructGPT models are now the default on OpenAI’s API, marking the first application of their alignment research to a product. While significant progress has been made, the models still have limitations and potential for misuse, which OpenAI continues to address through ongoing research and safety measures.
InstructGPT AI Features
- Improved instruction following: InstructGPT models demonstrate superior performance in following user instructions compared to GPT-3.
- Enhanced truthfulness: The models produce fewer imitative falsehoods and are less prone to making up facts (“hallucinating”).
- Reduced toxicity: InstructGPT shows a decrease in generating toxic outputs, as measured by standard metrics.
- Efficiency: A 1.3B parameter InstructGPT model is preferred by labelers over a 175B parameter GPT-3 model, despite having 100 times fewer parameters.
- Maintained capabilities: InstructGPT preserves GPT-3’s performance on academic NLP evaluations while improving on targeted areas