Understanding LLMs through a ranking perspective is insightful. Following pre-training, most performance enhancements concentrate on reranking by fine-tuning the proposal distribution to better align with task-specific requirements. Task-specific ranking criteria can be incorporated either as loss functions during training or as tokens during inference. Over time, optimizing both training and inference phases, along with iterative refinements, is essential for continued performance gains.

Sampling Next Token

LLMs operate by predicting the next token in a sequence. Specifically, 

  1. the model outputs a conditional categorical distribution Pn+1(Tn+1∣T≤n) over the vocabulary 𝒱 for the next token Tn+1​ conditioning on the historical tokens T≤n , i.e.,

  1. The next token is sampled based on the conditional distribution. Typically, following sampling methods can be applied to sample the sequence:

    1. Greedy decoding: Always selects the token with the highest probability until it reaches the end-of-sequence token. Note that it may not lead to the most probable sequence.

    2. Beam search: Maintains several sequences with high probabilities simultaneously, selecting the most probable overall. This results in sentences that are often more grammatically correct and semantically relevant.

    3. Top-k sampling: Restricts the model to selecting from the top k tokens with the highest probabilities.

    4. Top-p/nucleus sampling: Considers the smallest set of tokens whose cumulative probability is greater than or equal to a predefined threshold p.

Fig. 1 Next Token Generation


Misalignment with Task Requirements

Fig 1 depicts how a sequence is generated by applying greedy decoding. The resulting sequence is one path out of |𝒱| paths where 𝐿 is the maximum number of generated tokens. This sequence may or may not be the ideal response (the magenta sequence) for fulfilling task requirements, as the sampling heuristics are not specifically designed around those requirements. In other words, if we view the likelihood or probability of the sequence as a ranking score, current sampling methods consistently prioritize sequences with higher probabilities, which may not necessarily align with task-specific scores that define the desired output.

Remarks:

  1. Generally, the search space |𝒱|^𝐿 for all possible generations is incredibly large, for instance, the vocabulary size and the maximum context window size are both 128k for Llama 3.1 models. Hence, it’s nontrivial to rank each generation based on different prompts.

  2. Long-tailed behavior in LLMs means they struggle with rare examples while performing well on common ones, as their training is based on data frequency rather than task-specific requirements.


“Rerank” for Better Alignment

To address the misalignment between the task requirement and the learned proposal distribution for the next token or the generated sequence, many research works using reinforcement learning with human feedback (RLHF) [ref. 1, 2, 3] are proposed to adjust the sequence ranking by incorporating the task-specific scores, i.e., rewards, into the LLM training loss. Besides, inference time optimization methods [ref. 4, 5, 6] introduce task-specific requirements during inference time by providing problem solving rationales generated by the model itself or a stronger model. For continuous self-correction/self-improvement, [ref. 7, 8, 9] generate synthetic data using inference-time techniques, followed by training on that data.

Fig. 2 Proposal Distribution Alignment

Fig 2 Proposal Distribution Alignment illustrates how the proposal distribution can be modified to be aligned with the task requirements both in training and inference time by introducing task-specific requirements/criteria.


Training Time

Supervised Fine Tuning

The pre-training step helps the model to have an initial sequence proposal distribution that is similar to the distribution of the pre-training dataset. Supervised fine-tuning (SFT) enhances sequence ranking by providing only positive, task-specific examples. This is because the cross-entropy loss function is designed to prioritize and reinforce positive examples. By emphasizing positive examples, the proposal distribution becomes more aligned with task requirements. However, because negative examples are not addressed, SFT only partially aligns with the task. To tackle this, [ref. 10] introduced unlikelihood loss to account for negative examples during training.

RLHF

The classic RLHF loss function using PPO [ref. 11] is proposed as follows:

And the reward model is learned by optimizing the following loss function

where positive examples and negative examples are compared in pairs. In this manner, task-specific requirements are embedded through the reward value derived from pairwise ranking, leading to a more aligned proposal distribution. Since only a small subset of sequences is affected by reranking, the second and third terms in the loss function help maintain stability and avoid disrupting other sequences.

Similarly for DPO [ref. 2] and KTO [ref. 12], their loss functions embed the ranking preference and adjust the ranking in the original proposal distribution:

  1. DPO: pairwise ranking order and the difference between before and after reranking are considered.

  1. KTO: only pointwise ranking score is considered

Recently, group relative policy optimization (GRPO) [ref. 19] has gained tremendous attention due to the rising popularity of deepseek models [ref. 18, 20]. The preference score calculation effectively conducts the ranking process in a group-based manner.

Inference Time

Recent works [ref. 6, 13] show that scaling up test-time/inference-time computation is preferable to scaling up training in solving certain problems, especially in math problems. The common practice is to work on the token level and the sequence level respectively or at the same time to get better responses.

  1. Token Level: 

    1. Input token: by augmenting the input prompt with an additional set of tokens, task-specific criteria are either provided before generation or elicited during the generation to make proposal distribution more aligned with task-specific requirements. 

    2. Output token: by enforcing think tokens in the generation, typically, the problems are decomposed further in a more manageable way where each subproblem or the pattern of each subproblem may already be solved in the training data. In this way, the solution for each subproblem can eventually serve as “input context” to give the final answer. 

Fig. 3 Input Level Modification [ref. 14]

  1. Sequence Level: 

    1. by sampling, modifying and aggregating multiple candidate outputs, the proposal distribution is more effectively explored, and reranking is performed later within a smaller list.

Fig. 4 Output Level Modification [ref. 6]

Remarks:

  1. The verifier is employed to determine which candidate best responds to the input prompt based on task-specific requirements, reflecting the task's ranking criteria. Verification and modification of candidate responses can be applied sequentially and alternately to further enhance response accuracy.

  2. The model can be self improved [ref. 7, 8] by utilizing generated positive examples from itself in terms of format or accuracy.


Data Collection

Ideally, if the training data covers all scenarios needed for a given task, the model can achieve the desired outcomes through SFT. However, the challenge often lies in the lack of available datasets for a given task. To address this problem, two approaches have been proposed

  1. Data augmentation by human annotators: it’s less scalable because it’s time consuming and relatively costly.

  2. Synthetic dataset augmentation/generation: [ref. 15, 16, 17] show synthetic datasets can help improve the model performance. However, prompts, verifiers, and data seeds need to be carefully designed to ensure the quality and diversity of generated datasets, which are key in improving model performance. 

Fig. 5 Self-Instruct [ref. 15]

Fig. 6 AgentInstruct [ref. 16]


Takeaways

Examining LLMs from a ranking perspective is both fascinating and insightful. After the pre-training phase, most performance enhancement efforts focus on reranking by adjusting the proposal distribution to better align with task-specific requirements.

Therefore, in the long term, achieving optimal task-specific performance requires focusing on both training-time and inference-time optimizations. Additionally, iterative refinement by improving each aspect in relation to the other can further enhance overall performance.

Fig. 7 Proposal Distribution Transformation

Fig. 7 gives an overview of how the proposal distribution is modified at different phases:

  1. Hypernet assists in the initial transformation of the proposal distribution, helping to scope or accurately identify the distribution that best aligns with task-specific requirements.

  2. During inference, different verifiers with customized generation pipelines can be selected for different tasks to refine task-specific ranking and improve performance.

  3. Iteratively, training data can be enhanced with examples generated during inference, and the verifier can be further optimized with more advanced language models.


Disclaimers

This post is intended to provide insight into how LLMs can be effectively adapted to specific tasks through the lens of traditional ranking. A more in-depth study is needed for a comprehensive analysis of each relevant component.


Reference
  1. Training language models to follow instructions with human feedback, https://arxiv.org/abs/2203.02155

  2. Direct Preference Optimization: Your Language Model is Secretly a Reward Model, https://arxiv.org/abs/2305.18290

  3. A Survey of Reinforcement Learning from Human Feedback, https://arxiv.org/abs/2312.14925

  4. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, https://arxiv.org/abs/2201.11903

  5. Tree of Thoughts: Deliberate Problem Solving with Large Language Models, https://arxiv.org/abs/2305.10601

  6. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, https://arxiv.org/abs/2408.03314

  7. Recursive Introspection: Teaching Language Model Agents How to Self-Improve, https://arxiv.org/abs/2407.18219

  8. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models, https://arxiv.org/abs/2312.06585

  9. STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning, https://arxiv.org/abs/2203.14465

  10. Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning, https://arxiv.org/abs/2205.05638 

  11. Proximal Policy Optimization Algorithms, https://arxiv.org/abs/1707.06347

  12. KTO: Model Alignment as Prospect Theoretic Optimization, https://arxiv.org/abs/2402.01306

  13. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling, https://arxiv.org/abs/2407.21787

  14. Large Language Models are Zero-Shot Reasoners, https://arxiv.org/abs/2205.11916

  15. Self-Instruct: Aligning Language Models with Self-Generated Instructions, https://arxiv.org/abs/2212.10560 

  16. AgentInstruct: Toward Generative Teaching with Agentic Flows, https://arxiv.org/abs/2407.03502 

  17. Magicoder: Empowering Code Generation with OSS-Instruct, https://arxiv.org/abs/2312.02120

  18. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, https://arxiv.org/pdf/2501.12948v1

  19. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, https://arxiv.org/abs/2402.03300

  20. DeepSeek-V3 Technical Report, https://arxiv.org/pdf/2412.19437

Understanding LLMs through a ranking perspective is insightful. Following pre-training, most performance enhancements concentrate on reranking by fine-tuning the proposal distribution to better align with task-specific requirements. Task-specific ranking criteria can be incorporated either as loss functions during training or as tokens during inference. Over time, optimizing both training and inference phases, along with iterative refinements, is essential for continued performance gains.

Sampling Next Token

LLMs operate by predicting the next token in a sequence. Specifically, 

  1. the model outputs a conditional categorical distribution Pn+1(Tn+1∣T≤n) over the vocabulary 𝒱 for the next token Tn+1​ conditioning on the historical tokens T≤n , i.e.,

  1. The next token is sampled based on the conditional distribution. Typically, following sampling methods can be applied to sample the sequence:

    1. Greedy decoding: Always selects the token with the highest probability until it reaches the end-of-sequence token. Note that it may not lead to the most probable sequence.

    2. Beam search: Maintains several sequences with high probabilities simultaneously, selecting the most probable overall. This results in sentences that are often more grammatically correct and semantically relevant.

    3. Top-k sampling: Restricts the model to selecting from the top k tokens with the highest probabilities.

    4. Top-p/nucleus sampling: Considers the smallest set of tokens whose cumulative probability is greater than or equal to a predefined threshold p.

Fig. 1 Next Token Generation


Misalignment with Task Requirements

Fig 1 depicts how a sequence is generated by applying greedy decoding. The resulting sequence is one path out of |𝒱| paths where 𝐿 is the maximum number of generated tokens. This sequence may or may not be the ideal response (the magenta sequence) for fulfilling task requirements, as the sampling heuristics are not specifically designed around those requirements. In other words, if we view the likelihood or probability of the sequence as a ranking score, current sampling methods consistently prioritize sequences with higher probabilities, which may not necessarily align with task-specific scores that define the desired output.

Remarks:

  1. Generally, the search space |𝒱|^𝐿 for all possible generations is incredibly large, for instance, the vocabulary size and the maximum context window size are both 128k for Llama 3.1 models. Hence, it’s nontrivial to rank each generation based on different prompts.

  2. Long-tailed behavior in LLMs means they struggle with rare examples while performing well on common ones, as their training is based on data frequency rather than task-specific requirements.


“Rerank” for Better Alignment

To address the misalignment between the task requirement and the learned proposal distribution for the next token or the generated sequence, many research works using reinforcement learning with human feedback (RLHF) [ref. 1, 2, 3] are proposed to adjust the sequence ranking by incorporating the task-specific scores, i.e., rewards, into the LLM training loss. Besides, inference time optimization methods [ref. 4, 5, 6] introduce task-specific requirements during inference time by providing problem solving rationales generated by the model itself or a stronger model. For continuous self-correction/self-improvement, [ref. 7, 8, 9] generate synthetic data using inference-time techniques, followed by training on that data.

Fig. 2 Proposal Distribution Alignment

Fig 2 Proposal Distribution Alignment illustrates how the proposal distribution can be modified to be aligned with the task requirements both in training and inference time by introducing task-specific requirements/criteria.


Training Time

Supervised Fine Tuning

The pre-training step helps the model to have an initial sequence proposal distribution that is similar to the distribution of the pre-training dataset. Supervised fine-tuning (SFT) enhances sequence ranking by providing only positive, task-specific examples. This is because the cross-entropy loss function is designed to prioritize and reinforce positive examples. By emphasizing positive examples, the proposal distribution becomes more aligned with task requirements. However, because negative examples are not addressed, SFT only partially aligns with the task. To tackle this, [ref. 10] introduced unlikelihood loss to account for negative examples during training.

RLHF

The classic RLHF loss function using PPO [ref. 11] is proposed as follows:

And the reward model is learned by optimizing the following loss function

where positive examples and negative examples are compared in pairs. In this manner, task-specific requirements are embedded through the reward value derived from pairwise ranking, leading to a more aligned proposal distribution. Since only a small subset of sequences is affected by reranking, the second and third terms in the loss function help maintain stability and avoid disrupting other sequences.

Similarly for DPO [ref. 2] and KTO [ref. 12], their loss functions embed the ranking preference and adjust the ranking in the original proposal distribution:

  1. DPO: pairwise ranking order and the difference between before and after reranking are considered.

  1. KTO: only pointwise ranking score is considered

Recently, group relative policy optimization (GRPO) [ref. 19] has gained tremendous attention due to the rising popularity of deepseek models [ref. 18, 20]. The preference score calculation effectively conducts the ranking process in a group-based manner.

Inference Time

Recent works [ref. 6, 13] show that scaling up test-time/inference-time computation is preferable to scaling up training in solving certain problems, especially in math problems. The common practice is to work on the token level and the sequence level respectively or at the same time to get better responses.

  1. Token Level: 

    1. Input token: by augmenting the input prompt with an additional set of tokens, task-specific criteria are either provided before generation or elicited during the generation to make proposal distribution more aligned with task-specific requirements. 

    2. Output token: by enforcing think tokens in the generation, typically, the problems are decomposed further in a more manageable way where each subproblem or the pattern of each subproblem may already be solved in the training data. In this way, the solution for each subproblem can eventually serve as “input context” to give the final answer. 

Fig. 3 Input Level Modification [ref. 14]

  1. Sequence Level: 

    1. by sampling, modifying and aggregating multiple candidate outputs, the proposal distribution is more effectively explored, and reranking is performed later within a smaller list.

Fig. 4 Output Level Modification [ref. 6]

Remarks:

  1. The verifier is employed to determine which candidate best responds to the input prompt based on task-specific requirements, reflecting the task's ranking criteria. Verification and modification of candidate responses can be applied sequentially and alternately to further enhance response accuracy.

  2. The model can be self improved [ref. 7, 8] by utilizing generated positive examples from itself in terms of format or accuracy.


Data Collection

Ideally, if the training data covers all scenarios needed for a given task, the model can achieve the desired outcomes through SFT. However, the challenge often lies in the lack of available datasets for a given task. To address this problem, two approaches have been proposed

  1. Data augmentation by human annotators: it’s less scalable because it’s time consuming and relatively costly.

  2. Synthetic dataset augmentation/generation: [ref. 15, 16, 17] show synthetic datasets can help improve the model performance. However, prompts, verifiers, and data seeds need to be carefully designed to ensure the quality and diversity of generated datasets, which are key in improving model performance. 

Fig. 5 Self-Instruct [ref. 15]

Fig. 6 AgentInstruct [ref. 16]


Takeaways

Examining LLMs from a ranking perspective is both fascinating and insightful. After the pre-training phase, most performance enhancement efforts focus on reranking by adjusting the proposal distribution to better align with task-specific requirements.

Therefore, in the long term, achieving optimal task-specific performance requires focusing on both training-time and inference-time optimizations. Additionally, iterative refinement by improving each aspect in relation to the other can further enhance overall performance.

Fig. 7 Proposal Distribution Transformation

Fig. 7 gives an overview of how the proposal distribution is modified at different phases:

  1. Hypernet assists in the initial transformation of the proposal distribution, helping to scope or accurately identify the distribution that best aligns with task-specific requirements.

  2. During inference, different verifiers with customized generation pipelines can be selected for different tasks to refine task-specific ranking and improve performance.

  3. Iteratively, training data can be enhanced with examples generated during inference, and the verifier can be further optimized with more advanced language models.


Disclaimers

This post is intended to provide insight into how LLMs can be effectively adapted to specific tasks through the lens of traditional ranking. A more in-depth study is needed for a comprehensive analysis of each relevant component.


Reference
  1. Training language models to follow instructions with human feedback, https://arxiv.org/abs/2203.02155

  2. Direct Preference Optimization: Your Language Model is Secretly a Reward Model, https://arxiv.org/abs/2305.18290

  3. A Survey of Reinforcement Learning from Human Feedback, https://arxiv.org/abs/2312.14925

  4. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, https://arxiv.org/abs/2201.11903

  5. Tree of Thoughts: Deliberate Problem Solving with Large Language Models, https://arxiv.org/abs/2305.10601

  6. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, https://arxiv.org/abs/2408.03314

  7. Recursive Introspection: Teaching Language Model Agents How to Self-Improve, https://arxiv.org/abs/2407.18219

  8. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models, https://arxiv.org/abs/2312.06585

  9. STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning, https://arxiv.org/abs/2203.14465

  10. Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning, https://arxiv.org/abs/2205.05638 

  11. Proximal Policy Optimization Algorithms, https://arxiv.org/abs/1707.06347

  12. KTO: Model Alignment as Prospect Theoretic Optimization, https://arxiv.org/abs/2402.01306

  13. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling, https://arxiv.org/abs/2407.21787

  14. Large Language Models are Zero-Shot Reasoners, https://arxiv.org/abs/2205.11916

  15. Self-Instruct: Aligning Language Models with Self-Generated Instructions, https://arxiv.org/abs/2212.10560 

  16. AgentInstruct: Toward Generative Teaching with Agentic Flows, https://arxiv.org/abs/2407.03502 

  17. Magicoder: Empowering Code Generation with OSS-Instruct, https://arxiv.org/abs/2312.02120

  18. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, https://arxiv.org/pdf/2501.12948v1

  19. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, https://arxiv.org/abs/2402.03300

  20. DeepSeek-V3 Technical Report, https://arxiv.org/pdf/2412.19437

All rights reserved. Nace.AI © 2026

All rights reserved. Nace.AI © 2026