
You’re scrolling through an article, a product review, a comment on a forum. It’s well-written, clear, and informative. But something feels… off. It’s a little too polished, a little too structured. It’s helpful, but in a way that feels strangely impersonal. You have a fleeting thought: Was this written by a person?
This vague, intuitive feeling got me curious. Could I quantify it? Could I use data to prove that AI models have a distinct “voice”? To find out, I built a Colab notebook to systematically compare thousands of human and AI-generated texts. My goal was to find the statistical fingerprints that give the AI away.
You can run the analysis yourself and explore the code here.
So, what did my analysis flag? The results were surprisingly consistent. The AI has a distinct vernacular: not of errors, but of hyper-competence.
Here are some of the top “tells” my analysis flagged for the gemini-2.5-flash-preview-05–20 model:
- Signposting: At the very top of the list are words like here’s, break down, and let’s. The AI loves to begin its response by laying out a map for the user, as in “Here’s a breakdown of the key factors…” or “Let’s break down the process.” This is the language of a hyper-organized instructor.
- Authoritative qualifiers: Words like crucial, comprehensive, key, and primarily appear far more often in the AI’s vocabulary. It doesn’t just provide information; it frames that information as definitive and important. It offers not just a guide, but a “comprehensive guide.”
- Enthusiastic affirmations: The AI often begins its response by praising the user’s query. Bigrams like great question and that’s fantastic appear with significant frequency in the AI’s responses, but are almost entirely absent from the human corpus. This is the language of a hyper-agreeable assistant, designed to build rapport and validate the user’s input before delivering an answer. While a person might occasionally use these phrases, the AI uses them with a programmatic consistency that gives it away.
This is the voice of a “T.A. for the world” — endlessly patient, perfectly structured, and slightly formal. Its primary goal is to explain, and its word choice reflects that objective with unwavering focus.
This instructor-like voice becomes even more apparent when you see the top tells visualized as a word cloud:
Of course, the results are only as good as the data and the controls. I took care to mitigate the most obvious issues.
First, to ensure the AI wasn’t just being wordier, I configured its output to match the length of the human responses. This control is more critical than it might seem. A longer response isn’t just a short one with more words; it demands a different kind of language altogether. To make a longer text coherent, a writer must introduce more structural and transitional language. They need to add clauses to connect ideas (“Furthermore,” “In addition,” “However…”), phrases to frame points (“It’s important to consider…”) The graph below confirms the average response lengths were nearly identical.
Second, the dataset itself is a huge factor. I used the databricks-dolly-15k dataset, available on Hugging Face. It’s an excellent open-source dataset of 15,000 instruction-following prompts and responses generated by thousands of Databricks employees. However, because it’s focused on instruction-following, it likely biased the AI towards adopting its “helpful instructor” persona. The tells for creative writing would surely be different.
Finally, this is a snapshot in time of a single model. The “tells” of today’s Gemini are not permanent. As developers identify these patterns, they will train them out of future models to make them sound more natural. We are in a constant cat-and-mouse game where the tells will get subtler, and our intuition will have to get sharper.
Now, for those interested in the mechanics, here’s how I identified the “tells.” I used the log-likelihood ratio test (also known as a G-test), which checks if a word’s appearance in one body of text is significantly more frequent than in another.
The first step is to frame the problem in a 2×2 contingency table for each word being tested:
k₁ represents how many times the word appears in the AI text, and n₁ represents the total number of words in the AI text (and likewise for k₂ and n₂ in the human text).
These four values in the table are our Observed (O) frequencies. They are the facts, the ground truth of our data. But the main question is: are these observed numbers surprising?
To answer that, we need a baseline to compare against. The Expected (E) frequencies are what we’d see if the word were used at the exact same rate across both human and AI writing — a world where there is no “tell.” We calculate this by finding the word’s overall frequency across all texts combined and then applying that rate to the size of the AI and human corpora.
The log-likelihood test then calculates a “surprise” score by comparing the Observed reality to the Expected baseline. The formula looks like this:
A high G-score means our observed numbers are very different from the expected “no-tell” scenario, indicating the word is a strong statistical signal. A score near zero means the word is used about equally by both humans and the AI.
In the notebook, this entire process is handled directly by the scipy.stats.chi2_contingency
function.
Let’s bring this back to an important question: what does it mean to sound human? If the AI’s signature is its flawless structure, then perhaps the human signature is our beautiful, authentic messiness.
We hedge. We use filler words. We have pet phrases and weird analogies. We circle a topic before landing on a point. Our writing carries the faint, unique accent of our own mind — our memories, our mood, our very biology.
We are entering a “stylistic uncanny valley.” The AI is so close to sounding human that its small, systematic deviations become jarring. Its perfect posture and relentless helpfulness feel alien. It’s like talking to a person who has never been tired, bored, or hilariously wrong in their entire life.
The next time you read something that feels just a little too perfect, pause. You might not just be spotting an AI. You might be feeling the absence of the beautiful, chaotic, and unmistakable fingerprint of a human mind.
This analysis was powered by the same tools that are available to developers everywhere. If this exploration sparked your curiosity, I encourage you to go further:
- Experiment with Gemini: The analysis in my notebook uses the Gemini family of models. You can get started and build your own applications with these powerful and multimodal models through Google AI Studio for quick prototyping or integrate them into enterprise-grade applications using Vertex AI on Google Cloud.
- Build your own analysis: Fork my notebook and try a different dataset. See how the “tells” change when the AI is asked to write poetry, code, or marketing copy. Experiment with different features beyond unigram and bigram frequency. Fine-tune a model on your own writing and see how it compares. The possibilities for discovery are endless.
The line between human and machine-generated text will only continue to blur. The more we understand how these systems work, the better equipped we’ll be to use them thoughtfully and effectively. Feel free to connect on LinkedIn, X, and Bluesky to continue the discussion!
Source Credit: https://medium.com/google-cloud/detecting-ai-generated-text-by-uncovering-its-statistical-tells-042c8d0e3a24?source=rss—-e52cf94d98af—4