OpenAI O1 Preview Weekly Review

6 min readSep 20, 2024

What is the o1 model? A new series?

💡 The most important thing in a 3 sentence introduction is

O1 is a new type of intelligence system developed by OpenAI. The new route is no longer subject to the bottleneck of pre training, expanding the possibilities of inference computation to cope with more challenging inference scenarios.
O1 is designed by training a new model on a long inference chain and conducting extensive reinforcement learning(RL), teaching the model how to use its thought chain for efficient thinking in a highly data efficient manner and enable large-scale deployment.
O1 conducts online searches for users and invests more in inference, resulting in higher inference costs. It demonstrates the future direction of AI development.

💡 Why is it the all-new series o1?

As mentioned earlier, it represents OpenAI’s optimistic future direction, with the “o” in the name “o1” representing OpenAI and resetting the counter to 1. OpenAI hopes to redefine the reasoning ability of artificial intelligence through this model and usher in a new era.

From a technical perspective, the previous iteration direction of large models was mainly pre training, relying on a large amount of data. Today, natural data is no longer sufficient, and it is difficult to say which company can produce more than an order of magnitude of natural data. The academic community has long discussed post training for improving intelligence, which is why it is called o1, because the paradigm has changed and reinforcement learning has been introduced to continuously enhance intelligence.

💡 Some necessary information to know

O1 is the new generation of artificial intelligence previously revealed, codenamed Strawberry and Q *. The currently released versions are o1 preview and o1 mini. Many comparisons in official articles use the official version. As for why there are different speculations about why the official version has not been released yet, it may be due to considerations of inference costs, deployment pressure, and security management.
O1 preview is a scaled down version of o1 to determine its most suitable use cases and areas for improvement.
O1-mini is a more cost-effective version optimized for STEM applications. O1 mini achieves almost the same performance as O1 in mathematical and programming tasks, but at a significantly reduced cost.
At present, the call rate and frequency of o1 are limited, with five to ten calls per week, and the price is also very expensive. The output cost of 100m tokens is $60, which is four times that of GPT-4o. Official website price 🔗
In addition, due to the hidden thought chain of O1, users cannot see it, but it is included in the usage of output tokens.

💡 Some important data

✳️ Human beings have a higher evaluation of GPT-4o’s writing and editing tasks, while o1 has a higher evaluation of programming, data analysis, mathematical calculations, and other logical aspects.

✳️ Although most LLMs invest heavily in pre training, o1 allocates a larger portion to inference.

✳️ Compared to other state-of-the-art models, o1 represents a significant advancement in mathematical reasoning!

What scenarios is the O1 model applicable to?

Firstly, it is certain that it is not suitable for simple queries, as shown in the figure below, which consumes far more tokens and time than other LLMs, and is unnecessary and wasteful.

From official and various testing data, the o1 model demonstrates its potential in handling complex cognitive tasks, especially in high-value areas that require deep reasoning and analysis. So we can look forward to its development in scientific research, software programming, mathematical problems, etc., but at the same time, there is reason to suspect that reward models cannot cover all fields involving general knowledge.

Moreover, in terms of current time consumption, the application side can only be used in more asynchronous scenarios, which means it is difficult to target the mass consumer market in a short period of time. It is still more suitable for professional producers, with higher requirements for effectiveness and more tolerance for delays.

Currently, o1 only thought for a few seconds before responding. In describing the future, OpenAI’s vision is to enable models to think about answers in hours, days, or even weeks. Most scenarios do not require such high inference costs, but for some complex application scenarios such as academic, medical research, biological research, etc., it is highly anticipated.

What did the o1 model do behind it?

Extracting key points includes:

✳️The O1 model is trained using reinforcement learning algorithms, which can teach the model how to efficiently use its thought chain for thinking, and in this way, the performance of the model continues to improve with more reinforcement learning and more thinking time.

The O1 model utilizes a process reward model that scores each step in the reasoning process. This method is different from traditional reinforcement learning, which typically only provides a one-time reward at the end of the entire process. The process reward model can provide finer grained feedback to help the model improve at each step of inference.

In addition to process rewards, it is speculated that the o1 model may also incorporate result based reward models or heuristic methods to determine whether the model provides the correct answer. This may include penalties for answer length to prevent the model from generating endless non answer content.

It is worth noting that reward models in different scenarios may need to be customized and optimized according to specific tasks and goals, so o1 may not necessarily achieve usable results in many vertical logic problems.

✳️During the training process of the o1 model, dynamic Reasoning Tokens are introduced to inspire implicit thinking chains to think about problems, resulting in longer thinking time and stronger reasoning ability of the model.

This training method also involves “Post Training Scaling Laws”, which means that performance improvement can be achieved by increasing the exploration time of reinforcement learning training and the inference and thinking time of the model in the later stages of training.

Speculation on the Impact of o1 Release

OpenAI pointed out that the next step in achieving AGI is no longer retraining, but post training with reinforcement, and has chosen reinforcement learning as the starting point in this direction.

So can we say that the competition in the field of large models in the future is no longer about competing for general knowledge based on pre trained data, but about iterating the problem-solving mindset of models to achieve stronger reasoning abilities.

Reinforcement learning, as one of the key technologies in this new model, utilizes complex design and training techniques such as reward models, search and exploration mechanisms to enable the model to effectively perform calculations and analysis when faced with long sequence inference challenges. This approach will soon be extended to large-scale model competition. If it is in this direction, it is highly likely that reinforcement learning designs in vertical domains will produce more vertical models, and I think it is unlikely to have a reward model that is applicable to all inference scenarios.

The future official version of O1 is expected to provide more parameters for users to adjust the inference budget and search parameters, otherwise the hidden inference consumption would be too much to handle. Only in this way can the cost and effect be balanced in different scenarios.

reference

https://openai.com/index/learning-to-reason-with-llms

https://cdn.openai.com/o1-system-card.pdf

https://www.interconnects.ai/p/reverse-engineering-openai-o1

https://www.latent.space/p/openai-api-and-o1?r=q2z20

https://mp.weixin.qq.com/s/ZYIHoSUoTH4wd3d5Z2zmeQ