Unstructured Data 3: Skeletons in the Closet

19/10/2025

The hidden risks lurking in our unstructured data - and why GenAI will drag them into the light.

We've seen the elephant in the room, unstructured data, the 80–90% of information that never makes it into dashboards.

We've felt the rising tide, the relentless growth driven by collaboration, culture, and the frictionless creation of content.

Now it's time to open the closet.

Because hidden inside that tide are skeletons - forgotten files, sensitive data, risky documents - that most organisations don't even know they're storing. And with GenAI about to illuminate every corner of our information landscape, the days of "out of sight, out of mind" are over.

Security by Obscurity Is Dead

and so is the illusion of "Safe" Data.

For decades, we've relied on a quiet, often unspoken assumption: if sensitive data was buried deeply enough - in archived email threads, old SharePoint sites, or long-forgotten network drives - it was effectively safe. Nobody could find it, so it is not a risk.

That assumption no longer holds.

GenAI doesn't distinguish between current and obsolete, relevant and outdated, approved and deprecated. It reads everything — and in doing so, it surfaces everything. That doesn't just include sensitive information. It includes old decisions, superseded designs, obsolete policies, outdated standards, and past strategies that no longer reflect how the organisation operates today.

When those artefacts are surfaced without context, they don't simply confuse - they can contradict current policies, cloud governance decisions, and pollute AI-generated outputs with incorrect or conflicting information. The result isn't just exposure; it's distortion.

My own experience and recent large-scale analysis across 1,000 real-world environments found that 90% of organisations have exposed sensitive cloud data and 99% have sensitive information dangerously exposed to AI tools — a visibility problem as much as a security one.

What's Lurking in the Shadows

Every organisation has skeletons hiding in its unstructured data. These go far beyond PII or compliance data:

Sensitive personal information: CVs, ID documents, medical details, tax numbers.
Legacy decisions: strategy and policy papers that no longer apply, or even just old drafts, may still influence thinking.
Outdated standards and designs: technical documentation, old technical designs, or product plans that conflict with current approaches.
Regulated data: privacy-covered content embedded in chat logs, emails, or PDFs.
Shadow datasets: exports from SaaS platforms no one remembers storing.

IBM's Cost of a Data Breach 2024 found customer PII was the most frequently compromised data type (≈46%), underscoring how often "ordinary" content becomes high-risk when it leaks.

And Varonis' 2025 study shows the operational reality many teams face: 90% of organisations have sensitive files exposed to all employees via M365/Copilot - exactly the kind of context-free resurfacing GenAI accelerates. I love this semantic search capability in Copilot, just quietly.

The Cost of What We Don't See

The consequences are getting harder to ignore:

Regulatory fines: GDPR enforcement remains heavy; across 2024, fines totalled roughly €1.2 billion, per CMS's Enforcement Tracker analysis of monthly penalties.
Legal exposure: resurfaced, superseded clauses or contracts can re-open dormant liabilities.
Reputational damage: incidents involving archived or obsolete data make organisations look careless, not unlucky.
AI mistrust: when a tool like Copilot elevates a deprecated standard or an old decision as "fact", confidence in GenAI plummets.

High-profile cases in late 2024 and 2025 (e.g., large platform fines; collaboration-tool breaches) only add pressure for visible, defensible governance.

All of this slows us down, brings hesitancy in decisions around embracing this burgeoning Intelligence Era, and can result in lost opportunity, or direct market impact as we fall behind better prepared competitors.

Opening the Closet: A New Kind of Governance

The answer isn't to avoid GenAI - it's to prepare for it. Governance isn't about slowing innovation; it's about making innovation safe.

That means:

Visibility: Audit and classify data across collaboration platforms, file shares, and archives.
Control: Implement clear retention, deletion, and access policies - and actually enforce them.
Readiness: Assume AI will see and use everything. Build guardrails (policies and standards for labelling, DLP, access hygiene) accordingly.

It's telling that improving unstructured-data governance is now a board-level priority for nearly half of organisations, according to a 2024 industry survey.

Hope: Light Is Better Than Darkness

Yes, this is uncomfortable, opening the closet always is, but it's also empowering.

Once we know what's there - once we've named the skeletons - we can decide what stays, what goes, and what must be protected.

That knowledge turns AI from a threat into a tool. It lets us pursue automation and augmentation with confidence. And it transforms governance from a compliance exercise into a strategic enabler.

The age of security by obscurity is over. The era of informed, intentional governance is just beginning.

Call to Action

How confident are you that you know what's hiding in your unstructured data - and would you trust AI to read it all tomorrow?

In the next post, we'll look at why GenAI isn't just a risk, but also a remarkable opportunity - the double-edged sword that could finally help us bring this elephant under control.

Photo by Carlos Felipe Ramírez Mesa: Unsplash