Security Scores Are Meaningless

I build security scoring systems. I know better than anyone that they don't measure actual security. Here's why I keep building them anyway.

The embarrassing part of building security scores is that if you do it long enough, you stop believing in them.

I don’t mean you stop believing in measurement. Counting things is fine. Looking at a domain and noticing expired TLS certificates, missing DMARC, ancient SSH banners, open management ports — those are real signals. They are observable. They matter.

What I stopped believing in is the last step. The theatrical part where all those signals get crushed into a single number and somebody asks, with a straight face, “So how secure is it?”

That number is fake. Not random. Not useless in every possible way. But fake in the specific sense that it pretends to answer a question it cannot answer.

Hygiene Is Not Security

A domain can score 95 out of 100. HSTS enabled, DNSSEC signed, DMARC at p=reject, all the headers in place. Beautiful. And then it gets breached through a zero-day in a VPN appliance that no scoring system on earth would have caught. Meanwhile, a personal blog running on shared hosting scores a 40 and sits there untouched for a decade. Nobody attacks it because nobody cares about it.

The score measures hygiene. Not security.

Think of it like a health checkup. Your doctor checks your blood pressure, cholesterol, BMI. All normal. Great numbers. You walk out of the office and get hit by a bus. The checkup measured your baseline health — it said nothing about whether you’d survive the day.

The things that actually determine whether you get breached — unpatched internal services, weak credentials, a contractor’s laptop with RDP open, a phishing email that one person clicks — none of those show up in a DNS lookup or an HTTP header scan. They can’t. They’re invisible from the outside. The public internet gives you a peephole. Scores are what happen when someone mistakes the peephole for the building.

I Can Move Half the Internet With One Coefficient

Here’s the part that keeps me up at night.

Every scoring system assigns weights. HTTPS enforcement might be worth 15 points. DNSSEC worth 10. DMARC worth 12. How much should an expired certificate cost? How much should p=none in DMARC cost? Is missing DNSSEC a serious deduction or a shrug? There is no RFC that answers this. There is no mathematical theorem that says missing MTA-STS is worth 4 points but a weak SPF policy is worth 7.

You fiddle with the weights until the outputs feel plausible. That’s the dirty secret. You run the model over a large set of domains, stare at the distribution, and decide whether the curve looks embarrassing enough to be believable. If too many domains get A grades, you make the penalties harsher. If the internet suddenly looks like a radioactive junkyard, you loosen one knob.

Change one coefficient and you shift half the internet from a B to a C.

I’ve watched it happen. Tweak the penalty for missing Content-Security-Policy from -5 to -10 and thousands of domains drop a letter grade. Did they get less secure overnight? No. I moved a slider. After you’ve watched a single coefficient change reclassify half your dataset, it becomes very hard to say the resulting score is a property of the domain. It is partly a property of your mood.

The question “how secure is this domain?” sounds objective. It isn’t. We took a fundamentally subjective exercise, wrapped it in decimal points, and encouraged people to treat disagreement as imprecision instead of philosophy.

People Still Want the Number

And yet people absolutely love the number.

Of course they do.

Show someone a 47-page security audit and their eyes glaze over. Show them “B+” and they immediately understand where they stand. Or they think they do. The letter grade turns a squishy judgment into a procurement checkbox. It lets an executive ask for improvement without learning SMTP. It lets a board slide contain an arrow that points upward. It lets a security team say “we improved external posture by 18%” and everybody nods because the alternative is a 30-minute argument about what “posture” even means.

There’s a name for this: the McNamara fallacy. If you can measure it, it must be important. If you can’t measure it, it must not be. During Vietnam, Robert McNamara measured success by body count because that was the number he could get. The things that actually determined the war’s outcome — morale, political will, local support — weren’t on his spreadsheet.

Security scores are body counts. They’re the metric you can get, not the metric that matters.

The irony is that security scoring survives not because it is accurate, but because it is emotionally efficient. Quantification creates the illusion of control. A B-minus may be bad news, but at least it is legible bad news.

The Perverse Incentive

Once you have a score, people optimize for the score instead of for actual security.

I’ve watched organizations spend weeks adding obscure HTTP headers to bump their grade from B to A, while ignoring a three-year-old Apache Struts installation on an internal subnet. The headers are visible. The scanner checks them. The grade goes up. The actual risk doesn’t change at all. Meanwhile they’re not doing tabletop exercises, not auditing identity access, not hunting for logic flaws in their application architecture.

This is Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. The moment someone decides “we need an A rating,” the score stops reflecting security posture and starts reflecting score-optimization effort.

We have built an entire industry around measuring the outside of the fortress while the guards inside leave the back door propped open for fresh air.

What the Score Is Actually For

The honest answer is narrower and less glamorous than anyone wants.

A security score is decent at one job: turning a pile of externally visible hygiene issues into a queue. It tells you where to look first. If you’re assessing fifty vendors, dropping the bottom twenty based on terrible scores is a reasonable heuristic. If your DMARC enforcement disappeared after a migration, the score will catch the regression. As a change detector, as a rough sorting mechanism for operational sloppiness, it can earn its keep.

The moment you use it as a proxy for “likelihood of compromise,” you’re back to fortune-telling.

If I could add a disclaimer to every score I generate, it would say: This number tells you whether someone configured the obvious things. It says nothing about the non-obvious things, which are where breaches actually happen. But nobody wants to read that. They want the letter grade. They want the green checkmark.

I build the green checkmarks. I just don’t believe in them as much as you do.

Continue the conversation

← Back to Blog