[待补充]

请提供需要打标的正文内容,以便我从给定标签中选择匹配的选项。
Cursor团队揭秘:在线强化学习只是"蛋糕上的樱桃"
金句精选
The paradox of online RL is that we can't use this to really create the model from scratch because users need to be using the model. It has to be good already.
It's kind of like cherry on top to really get this super delightful experience. Hopefully one day it will be like big big cherry.
Offline RL is more like DPO kind of technique. The sort of reinforce kind of RL is online.