Does Size (of Data) Matter?

· Dr. Ramy Azzam

Does Size (of Data) Matter?

London, 1854. A cholera outbreak was ravaging the city, killing over 10,000 people, and no one knew why. The prevailing "miasma theory" blamed bad air. That is, until Dr. John Snow, armed with a notebook and an inquisitive mind, mapped out just 616 cholera cases by hand. His discovery? A single water pump on Broad Street was the culprit. That modest, hand-collected dataset, smaller than most modern A/B tests, changed the trajectory of public health forever and laid the foundation for modern epidemiology.

Snow's breakthrough came not from analyzing all London's data, but from noticing that brewery workers near the pump weren't getting sick. Why? They drank beer, not water. This single qualitative observation validated his quantitative hypothesis and saved countless lives.

Now shift to today. The tools have changed, spreadsheets have become dashboards, clinical notes have become data lakes storing 2.5 quintillion bytes of health data daily, and every buzzword from "predictive analytics" to "generative AI" is fighting for attention. But one thing remains consistent: in the pursuit of hype, small data often gets left behind. Sometimes it's an afterthought. Sometimes, it's not even in the conversation.


Shiny Big Data

Consider these numbers: Google processes 3.5 billion searches daily; Facebook generates 4 petabytes of data per day; the average hospital produces 50 petabytes annually. Big data is flashy. It scales. It demos well. And in boardrooms and investor decks, it tends to wow the room.

Big data makes catchy headlines, with national genomics platforms (the UK Biobank contains genetic data from ~500,000 participants), AI-enabled diagnostics, and digital twins of entire patient populations. It feeds the dream of total insight. Yet here's a sobering statistic: despite having access to more health data than ever before, 80% of AI projects in healthcare fail to reach clinical deployment, often because they lack contextual understanding.

Let's be fair. Big data has revolutionized entire industries. In healthcare, Epic's predictive models analyze ~ half a million patient records to detect sepsis 6 hours earlier than traditional methods. During COVID-19, Google Trends data detected outbreaks 1-2 weeks before official reporting in 80% of global regions.

In finance, JPMorgan Chase analyzes 5 billion transactions daily for fraud detection with 99.9% accuracy. In retail, Amazon's recommendation engine drives 35% of sales through analysis of 300+ million customer profiles. Netflix's algorithm, processing viewing data from 230 million subscribers, saves the company $1 billion annually in customer retention. These systemic shifts are often powered by data at scale.

But does scale guarantee significance?

Article contentBig Tech... Big Data


What Scale (Sometimes) Obscures

IBM Watson for Oncology is the cautionary tale of our time. Despite training on 650,000 medical papers and 12,000 oncology cases, it failed to live up to expectations. The issue wasn't algorithmic sophistication, it was data quality. Training data was primarily from Memorial Sloan Kettering's guidelines, not diverse real-world cases. A $62 million investment that multiple hospitals abandoned because it recommended treatments contradicting local clinical expertise.

Here's a telling detail: radiologists at MD Anderson found Watson agreed with their decisions only 96% of the time for lung cancer cases, sounds impressive until you realize the 4% disagreement often involved life-or-death treatment decisions.

Big data amplifies three critical risks: bias amplification (Amazon's hiring algorithm discriminated against women because historical data reflected male-dominated hiring), consent erosion (23andMe's pivot to drug development using customer genetic data), and accountability gaps (GPT models can't trace specific outputs to training inputs).

It's seductive to imagine that more data equals more insight. But insight isn't found in terabytes. It's found in context. And context is the home turf of small data.

Article contentTae-hyun Cho (right), the first Korean to be treated with assistance from Watson for Oncology, reviews his medical information with oncologists at Gachon University Gil Medical Center.


The Overlooked Power of Small Data

Small data doesn't shout. It whispers. It's the user feedback buried deep in a spreadsheet column. The outlier case that explains an entire design failure. It's not always scalable, but it's often the source of meaning. And yet, in product roadmaps and AI models, it's frequently sidelined.

Reflect on this: Phase III clinical trials typically involve 300-3,000 participants (with many Phase II trials using just 25-300 participants), the most cited medical papers often involve much smaller samples. The original study linking smoking to lung cancer by Doll and Hill? Just 649 doctors. The discovery of H. pylori causing ulcers (Nobel Prize, 2005)? Barry Marshall infected himself, an N-of-1 trial that revolutionized gastroenterology.

Research increasingly shows that larger datasets don't guarantee better real-world performance. A systematic review in npj Digital Medicine found that generative AI models achieved only 52.1% diagnostic accuracy overall, often performing similarly to non-expert physicians despite training on massive datasets. Meanwhile, studies consistently demonstrate that models incorporating qualitative insights from smaller, carefully curated samples often achieve better real-world generalization than those trained purely on volume.


The Truth of the Personal

Take psychiatry, a field that fundamentally defies standardization. In 1973, the famous "Rosenhan experiment" exposed how psychiatric diagnoses could be completely wrong despite following standard protocols. Fast-forward to today: the University of Utah runs over 200 N-of-1 trials annually, with 85% showing clinically meaningful results for individual patients, proving that personalized insights often matter more than population trends.

The story of psychiatric diagnosis itself tells this tale. Every version of the DSM was built from structured case observations, not machine learning algorithms. When developing the DSM-5, researchers conducted 1,500 expert reviews of individual case studies over 13 years, validating each diagnosis through careful small-sample clinical observations rather than big data analytics.

Meanwhile, at Harvard, Marc Brackett's team made a surprising discovery using self-reports from 45,000 students. While everyone expected anxiety to dominate classroom emotions, they found boredom was actually the primary feeling, a finding that reshaped educational psychology and couldn't have emerged from passive data collection alone.

Perhaps most tellingly, IDEO, the design consultancy behind the computer mouse and countless breakthrough products, generates their innovations through ethnographic interviews with just 8-12 users. Their philosophy captures the essence perfectly: "If you want to know what people think, ask them. If you want to know what they do, watch them."

Article contentInside IDEO


Ethics Needs Granularity

In my work certifying ethical AI systems, traceability and transparency are non-negotiable. The EU's AI Act requires "meaningful human oversight", impossible with black-box algorithms trained on billions of data points. With small data, you can trace each variable, document consent, and explain decisions.

Stanford's AI4ALL research found that models trained on datasets under 10,000 examples were 3x more likely to maintain performance across demographic groups compared to those trained on millions of examples. Why? Because small datasets forced researchers to carefully balance representation, while large datasets allowed biases to hide in statistical noise.

Then there's Cultural Sensitivity. A mental health app trained on U.S. data might fail completely in the Gulf region. Talkspace discovered this when their AI struggled with Arabic speakers, not due to language barriers, but cultural expressions of distress. Their solution? Local datasets of 500-1,000 culturally-specific interactions outperformed their global model trained on 10 million sessions.


Listening Before Scaling

We've become so enamored with scale that we've started skipping the step that gives data its purpose: listening. Toyota's legendary quality comes from "genchi genbutsu", going to see the actual place and situation. Each improvement starts with observing small samples, not analyzing all production data.

Our brains are naturally wired for small data processing. Daniel Kahneman's research shows humans can effectively process 7±2 pieces of information simultaneously. The "Dunbar number" suggests we maintain meaningful relationships with ~150 people. Evolution optimized us for small group dynamics, not big data comprehension.

Modern neuroscience reveals why small data resonates: mirror neurons fire when we hear individual stories but remain dormant during statistical presentations. This isn't a bug, it's a feature. Empathy scales with narrative, not numbers.


What Are the #Ramyfications, you say?

In the rush for buzzwords, small data is often ignored. But that's a strategic mistake with measurable consequences:

  • ROI Impact: Companies using hybrid big/small data approaches show 23% higher ROI on AI investments (McKinsey, 2024)

  • User Retention: Products incorporating qualitative insights retain users 31% longer (Firebase Analytics, 2024)

  • Clinical Outcomes: Healthcare AI systems using mixed methodologies show 18% better real-world performance (Nature Digital Medicine, 2024)

Big data gives you breadth. Small data gives you depth. Real insight lies at their intersection, where numbers meet narratives. For responsible AI and meaningful digital health, small data isn't optional, it's foundational.

Sometimes, the most transformative moment isn't in the dashboard. It's in the handwritten note. The single conversation. The quiet contradiction that upends an entire model.