A simple way to explain Big Data to anyone

Two things happened this morning. The first was that I chose to break a long-standing rule and signed up for the New York times monthly subscription. Ugh, the hated paywall. As cordcutters, we pay for very little of our news and entertainment, Netflix aside.

Screen Shot 2013-12-25 at 1.20.22 PMWhich is how the second thing happened. I paid because they enticed me with a quiz that promised to identify where an American is from by correlating the way words are pronounced and used for common things. There were questions like, “How would you address a group of two or more people?” After I answered, “you guys,” it showed that I’m in the mainstream of the U.S. with the exception of the Old South. Each question showed the usage patterns for a word or words or pronunciation on the subsequent screen.

Western New York!

I took the test based on the words I used growing up, not the words I use today, after living in a few countries and many states. A few questions later, it said I was likely from Western New York, which was spot on (a term I would never use back home). But even better, it told me both the broad pattern based on responses to all of the questions as well as the narrow pattern, a strong correlation to one answer — the word I grew up using for athletic shoes, “sneakers.”

It was an excellent example of key principles of Big Data analytics and here’s why:

Enough data to show a pattern

The quiz was developed using 350,000 responses. That’s not an enormous data set by Big Data standards, but it represents enough data to very accurately create an outcome — in this case, where someone is from. Big Data doesn’t have to be massive to work. It only has to have enough data for useful patterns to be revealed. In this case, 350,000 variations (responses) of 25 data points (the questions) was clearly enough. Some questions require far larger data sets to show patterns, some less.

More data isn’t necessarily better

Use of the word "sneakers"The 25 questions chosen for the quiz were culled from a larger number of questions in the original sample. These were the questions that produced the data that identified a person’s origins efficiently. More questions doesn’t create better answers because some data simply matters more, some much less. Each company exploring Big Data has to come to terms with what matters most for getting the best answer in the most appropriate time frame for taking action.

Data can be weak or strong

Not all data is created (or used) equal(ly). The fact that “sneakers” is what the NY Times called “most distinctive” for my husband’s vocabulary shows that more data doesn’t always mean better answers, as some data is simply stronger in meaning because it teases out the answer more quickly. How I answered the question about what to call a freight moving truck (I said, “semi”) likely meant less in the outcome.

Seemingly unrelated data

Quiz questions crossed the boundaries of word use, colloquialisms and pronunciation, three concepts connected to geography but not necessarily related. If we only had data and didn’t know it represented linguistic patterns from a certain country, it might appear unrelated. Big Data technology is all about working backward from unrelated and unstructured data to find patterns that produce an output. If we had no map, we could still ascertain that certain people had a similar background from patterns in their language.

Visualization is really cool

John Snow Cholera mapA picture truly is worth a thousand words. In a world where focused attention is in short supply, data visualization breaks through the noise. Data doesn’t have value by itself unless it has context for its meaning and use. Just as John Snow famously plotted cholera case in London to trace the disease back to contaminated water, data can have meaning that remains hidden until it shows up in visual context (in this case, geography). In our NY Times quiz, we see that linguistics has dense clusters and areas of lower correlation corresponding to American westward migration patterns (see below).

Getting back to Big Data

While this demonstrates aspects of the analytics side of Big Data, it isn’t really a Big Data example. That’s because Big Data is defined by O’Reilly analyst Edd Dumbill as, “data that exceeds the processing capacity of conventional database systems. To gain value from this data, you must choose an alternative way to process it.” (Note: Gartner analyst Doug Laney was the first to define Big Data back in 2001 referring to it as data with high volume, velocity and variety.)

Secondly, as the quiz shows, data is nuanced, and more data is potentially even more nuanced. This means it takes some combination of business experts with powerful tools, data scientists (of which there are few), and maybe machine learning applications like LIONsolver or Ayasdi. Getting your arms around Big Data is a big job.

Lastly, we can’t say for sure what data will be valuable at a future point in time or to untapped parts of the business. We don’t know what data seems silo’d today that will join up with other data tomorrow to have real context. That means we have to always keep collecting, keeping in mind that more than Big Data, the challenge is all data.

Screen Shot 2013-12-25 at 5.22.02 PM

 

 

 

 

 

 

Tags: , ,

4 Responses to “A simple way to explain Big Data to anyone”

  1. Mark Eastwood
    December 27, 2013 at 2:31 pm #

    Christopher,
    Nice article, the FICO score that was invented in the 80s is another example of inspecting large data sets and finding those tidbits of data that are predictive (have informational value). What’s interesting about how the FICO score was developed is that it not only uses individual data elements, but unique combinations of data elements as “features” in the data. So for example the fact that you say both “sneakers” AND some other word could be a single derived fact that has higher “weight of evidence” than either word used alone.
    As I think you point out, the initial challenge is identifying those features that have value. Sometimes it depends on the experience of the SME and sometimes GAs and other means are interesting automated ways of finding these features in the data. Lastly, there is value to unstructured data, but its much more difficult to use.
    As you point out the usefulness if this data/information is within some context. In the case of the FICO score its the likelihood that someone will repay existing credit obligations within the next 24 months. Obtaining new credit (for example) is a different context. The same science applies, but perhaps different data features will prove to be valuable. Similarly with CPG sales data, the attractiveness of a particular product to a particular demographic is only part of the story. Each retail store has a customer base that either matches or doesn’t match the demographics that find a given product interesting. Its the combination of the two that should lead to sales (the converse is also true).

    Cheers,
    Mark

    • Chris Taylor
      December 27, 2013 at 2:57 pm #

      Great comments, Mark. You bring up a great concept that I didn’t cover…that context varies as well.

  2. Mary Molaskey
    December 30, 2013 at 8:10 pm #

    Chris,

    I enjoyed your post. Always educational and witty.

    • Chris Taylor
      December 30, 2013 at 8:12 pm #

      Thanks, Mary.

Leave a Reply