Detecting AI Text
A deep-dive into PyTorch logits, including AI Text Detection with Perplexity Scoring
Introduction
For all of us that frequently work with PyTorch, logits are almost always an afterthought. An object we extract from outputs before swiftly moving onto downstream tasks in the pipeline.
But in reality logits are so much more than an intermediary means to an end. They are the model's "gut feelings"! They are the model's intuition on how "confident" it is about leaning towards one choice or another, yet logits preserve all the nuanced math the model used to improve over the propogation steps.
This fascinating relationship between confidence and prediction mirrors the human experience of intuition before decisions, and it leads itself to some interesting applications. One of these applications is Perplexity-Based AI Detection.
PyTorch
Before going into Perplexity Scoring, let's review some fundamental PyTorch concepts, as they pertain to LLM functionality.
Model & Tokenizer
A crucial first step in any LLM pipeline is to initialize a model and tokenizer. Here I use the AutoModelForCausalLM and AutoTokenizer classes from HuggingFace to do so.
Note that I am using the Qwen3-0.6B model here for its relatively small size.
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM
MODEL_NAME = "Qwen/Qwen3-0.6B"
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
device_map = "cuda",
torch_dtype = torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
Encodings, Inputs, and Outputs
Now that we have our tokenizer we can create embeddings and inputs from some sample input text:
text = "It's not rocket"
encodings = tokenizer(text, return_tensors = "pt")
input_ids = encodings.input_ids.to("cuda")
print(input_ids)
tensor([[ 2132, 594, 537, 24306]], device='cuda:0')
We can use tokenizer.decode() to see just how the model tokenized and encoded the input text into the 4 input IDs:
for i, ii in enumerate(input_ids[0]):
dec = tokenizer.decode([ii.item()])
print(f"Token {i+1}: {dec}")
Token 1: It
Token 2: 's
Token 3: not
Token 4: rocket
Finally, outputs are created by passing the inputs to the model.
Note that I use torch.no_grad() here to disable gradient calculation, which will significantly reduce memory usage. Gradients are crucial during training but since this is just inference they are not needed.
with torch.no_grad():
outputs = model(input_ids)
Logits
Logits are the focus of this article and a fundamental concept to understand. At their core, they are the raw prediction scores for every token in the model's vocabulary, at each position of the input text, before any activation function (sigmoid, softmax, etc) is applied.
logits = outputs.logits
print(logits.size())
torch.Size([1, 4, 151936])
In this case, the [1, 4, 151936] above refers to:
▪️ 1 : Batch Size (for our one text batch - "It's not rocket")
▪️ 4 : Encoded Sequence Length (ie len(input_ids))
▪️ 151936 : Vocab Size (the model's inherent vocabulary size)
A very important concept to understand is that each logits position relates to the next token in our sequence. In doing so it returns 151,936 scores (one per vocabulary entry):
▪️ logits[0, 0, :] : 151,936 scores for token that comes after "It"
▪️ logits[0, 1, :] : 151,936 scores for token that comes after "It's"
▪️ logits[0, 2, :] : 151,936 scores for token that comes after "It's not"
▪️ logits[0, 3, :] : 151,936 scores for token that comes after "It's not rocket"
Another concept to understand is that logits are not model weights, but rather the model's final layer. In other words, they are the outputs of the final layer, not the weights used to calculate them.
While the weights are the learned parameters that transform hidden states into logits, the logits are the result of applying those weights to the specific input.
Logits to Probabilities
A highly useful functionality of logits is that they can be transformed into probabilities. This is done using the torch.softmax() method.
As an example, lets take a look at the logits for the token that the model predicted to come after the "It's not rocket" text, ie the last logits object:
last_logits = logits[0, -1, :] # equivalent to logits[0, 3, :]
probs = torch.softmax(last_logits, dim = -1)
print(probs.size())
torch.Size([151936])
The output is 151,936 probabilities, one for each of the 151,936 entries in the model's vocabulary.
If we sort these probabilities by highest-to-lowest using torch.argsort() we can see the top token predictions. Here we look at the top 5 most probable tokens:
sorted_probs_indices = torch.argsort(probs, descending = True)
for idx in sorted_probs_indices[:5]:
tkn = tokenizer.decode([idx])
prb = np.round(probs[idx].item(), 4)
print(f"Token: {tkn} | prob = {prb}")
Token: science | prob = 0.9688
Token: surgery | prob = 0.0027
Token: fuel | prob = 0.0026
Token: ing | prob = 0.0024
Token: scientists | prob = 0.0015
It makes sense that the most probable next-token for the phrase "It's not rocket" turns out to be "science" 🤖
With this in mind, we can write a small function that completes a phrase and also returns the probability for the predicted token it used for the completion:
def get_pred_with_prob(txt):
encodings = tokenizer(txt, return_tensors = "pt")
input_ids = encodings.input_ids.to("cuda")
with torch.no_grad():
outputs = model(input_ids)
logits = outputs.logits
last_logits = logits[0, -1, :]
probs = torch.softmax(last_logits, dim = -1)
top_probs_index = torch.argsort(probs, descending = True)[0]
token = tokenizer.decode([top_probs_index])
print(f"'{txt + token}'\nProbability: {np.round(probs[top_probs_index].item(), 4)}")
get_pred_with_prob("It's not rocket")
'It's not rocket science'
Probability: 0.9688
get_pred_with_prob("A rose by any other")
'A rose by any other name'
Probability: 0.9961
get_pred_with_prob("Life is like a box of")
'Life is like a box of chocolates'
Probability: 0.8555
Logits to Predictions
Of course probabilities are not the only method by which predictions can be extracted from logits.
One way is to use the torch.argmax() method which returns an index of the highest score. We can then use that index to access the score and token at that index:
last_logits = logits[0, -1, :]
highest_score_index = torch.argmax(last_logits).item()
print(tokenizer.decode([highest_score_index]))
' science'
Another method to extract the top scores (and their tokens) is by using torch.topk().
Here we pull out the top 5 predictions:
top_5_scores, top_5_indices = torch.topk(last_logits, k = 5)
for idx, score in zip(top_5_indices, top_5_scores):
tkn = tokenizer.decode([idx])
scr = np.round(score.item(), 4)
print(f"Token: {tkn} | score = {scr}")
Token: science | score = 20.5
Token: surgery | score = 14.6875
Token: fuel | score = 14.625
Token: ing | score = 14.5
Token: scientists | score = 14.125
With our logits review out of the way, let's finally move onto detecting AI text using Perplexity Scoring.
Perplexity
The term Perplexity in this context is a sort of "surprise score". More specifically, it measures how surprised an LLM is to encounter a specific token. The fundamental principle of the perplexity-based datection method is that it utilizes an LLM's core mode of function - predicting the next token based on the highest score of occurance, given some previous token(s).
The perplexity calculation is based on the PyTorch built-in Cross-Entropy Loss calculation, returned from the outputs.loss method. In fact, Cross-Entropy Loss in PyTorch is literally the Negative Log-Likelihood (aka Negative Log-Probability) of the correct token.
In following, the Perplexity Score is just the exponent of negative log-likelihood. In other words:
▪️ perplexity = 1/probability = exp(-log_prob) = exp(loss)
The convention of using a reciprocal (1/probability) for the Perplexity Score stems from the inverse relationship between probability and uncertainty ("confusion"). As probability of a correct token increases, the uncertainty of the model decreases.
In this way an intuitive scale is created whereby higher Perplexity Scores mean more uncertainty:
▪️ Higher Perplexity = "Higher model uncertainty" = text is more likely Human
▪️ Lower Perplexity = "Lower model uncertainty" = text is more likely AI
Calculating Perplexity
The first step to calculating a perplexity score is to create the outputs object. But unlike previous sections, we have to also populate the labels arg in order to return the objects.loss value.
This makes sense as the labels arg provides the targets against which the model will compare its own predictions and thus calculate a loss. No targets, no comparisons, no loss calculation.
In our case, the targets are simply a clone of the inputs because our entire methodology relies on comparing existing text to what a model "thinks" it should be (ie what it predicts) in order to calculate how "confused" it is.
target_ids = input_ids.clone()
with torch.no_grad():
outputs = model(input_ids, labels = target_ids)
Next, we extract the Negative Log-Probability which is (again) just the value of outputs.loss:
neg_log_prob = outputs.loss
Finally, we calculate a Perplexity Score by taking the exponent of the negative log-probability:
perplexity = torch.exp(neg_log_prob).item()
We can combine these steps into a function:
def calc_perplexity(text, model, tokenizer):
"""
Calculates the Perplexity Score for a given text.
Args:
text: Input text string
model: The 'AutoModelForCausalLM' model
tokenizer: The 'AutoTokenizer' tokenizer
Returns:
float: Perplexity score
"""
## Set device:
device = "cuda" if torch.cuda.is_available() else "cpu"
## Inputs & Targets:
input_ids = tokenizer(text, return_tensors = "pt").input_ids.to(device)
target_ids = input_ids.clone()
## Calc negative log-likelihoods:
with torch.no_grad():
outputs = model(input_ids, labels = target_ids)
neg_log_prob = outputs.loss
## Calc & return perplexity score:
perp = torch.exp(neg_log_prob).item()
return np.round(perp, 2)
Let's test it on some text examples:
human_text = """
I can't believe it's already Thursday! Time flies when you're busy, I guess.
Had the weirdest dream last night about flying penguins. Why penguins? No idea.
Anyway, need to grab coffee before my meeting. This week has been absolutely insane.
"""
ai_text = """
Artificial intelligence has become increasingly important in modern society.
It offers numerous benefits across various industries and applications.
Machine learning algorithms enable computers to learn from data and improve over time.
This technology continues to advance and shape our future in meaningful ways.
"""
human_perp = calc_perplexity(human_text, model, tokenizer)
print(human_perp)
43.34
ai_perp = calc_perplexity(ai_text, model, tokenizer)
print(ai_perp)
8.57
Our results corroborate our previous assumptions:
▪️ Higher Perplexity (43.34) = "Higher model uncertainty" = text is more likely Human
▪️ Lower Perplexity (8.57) = "Lower model uncertainty" = text is more likely AI
Denouement
Logits are the often-unsung heroes of PyTorch pipeline development. We use them without giving much thought to their intrinsic mode of operation. We ignore their equivalence to the human decision-making steps of intuition before decision.
And while this is probably an overly anthropomorphic description, it nevertheless approximates the true value of the logits' role in the complexity of machine learning.
As always, thank you very much for reading!