Big Data is all about knowing what questions to ask

September 25, 2013 by Chris Skinner

Like everyone, I was shocked by the news of
the Westgate Shopping Mall shootings in Nairobi, Kenya.

The real shock is how they determined who to shoot, singling
out individuals and asking whether they could name the Prophet Muhammad’s
mother:

Reports from separate
floors of the building in the first hours of the assault told how the
attackers, speaking rough Swahili and English, shouted at Muslims to identify
themselves. Many people came forward. They were ordered to speak in Arabic, or
to recite a verse from the Koran, or to name the Prophet Mohammed’s mother.
Those who passed this test were allowed to flee. Those that did not were
executed, including children.

It almost seems perfunctory to relate this to banking, but
it did sit firmly in my mind as I chaired a meeting around Big Data last night.

I hate the term Big Data, as mentioned before, and feel it needs a context and so here is the context.

I don’t know the name of the Prophet’s mother but, within
two seconds, I can Google the answer: Aminah
bint Wahb.

I don’t know a verse of the Koran but can find one in
seconds online: Assalamu alaikum wa rahmatullahi
wa barakatuh (May the peace, mercy, and blessings of Allah be with you).

And there is the context of Big Data: if you don’t know the
question, how can you find the answer?

The discussion about Big Data last night was in the question
of Fraud and Anti-Money Laundering (AML) and was a wide ranging conversation.

Big Data for fraud and AML is all about cost avoidance whilst,
on the other hand, much of the Big Data conversation is about marketing and
sales for revenue uptick.

Both are valid uses of Big Data analytics, but this market
is nothing new.

Teradata was doing all this stuff in the 1990s with propensity
modelling and data mining, with Wal*Mart their biggest customer in the world back
then, with a 27 terabyte database.

The change today is that the world produces 27 terabytes every
few seconds thanks to social media.

This is well illustrated by Maria Conner’s recent blog entry:

In 2012, every day 2.5 quintillion bytes of data (1
followed by 18 zeros) are created, with 90% of the world’s data created in
the last two years alone. As a society, we’re producing and capturing more data
each day than was seen by everyone since the beginning of the earth.

This vast amount of digital data would fill DVD stack
reaching from the Earth
to moon and back. To put things in perspective, the entire works of
William Shakespeare (in text form) represent about 5 MB of data. So,
you could store about 1,000 copies of Shakespeare on a single DVD. The text in
all the books in the Library of Congress would fit comfortably on a stack of
DVDs the height of a single-story house.

The world’s technological per-capita capacity to store
information has roughly doubled every 40 months since the 1980s according to Martin
Hilbert and Priscila López.

Given that unstructured data accounts for 80% of the data in
the world, and we know much of that is from social media that gets special
attention.

How much data is generated through social media tools?

People
send more than 144.8
billion Email messages sent a day.
People
and brands on Twitter send more than 340
million tweets a day.
People
on Facebook share more than 684,000 bits of content a
day.
People
upload 72 hours (259,200 seconds) of new video to YouTube a
minute.
Consumers spend
$272,000 on Web shopping a day.
Google receives
over 2 million
search queries a minute.
Apple receives
around 47,000 app downloads a minute.
Brands receive
more than 34,000 Facebook ‘likes’ a minute.
Tumblr blog
owners publish 27,000 new posts a minute.
Instagram photographers
share 3,600 new photos a minute.
Flickr photographers
upload 3,125 new photos a minute.
People
perform over 2,000 Foursquare check-ins a minute.
Individuals
and organizations launch 571 new websites a minute.
WordPress bloggers
publish close to 350 new blog posts a minute.
The Mobile
Web receives 217 new participants a minute.

The most updated
numbers are available from the sites themselves.

So what?

Well the so what
test is that twenty years ago, we could not produce, search, analyse and track so
much data because it was too costly.

Teradata used to refer to their systems as BFOBs (Big F-Off
Boxes) and that it would be a $20 million plus investment to get one up and running
effectively. Today, you can do that
analysis in the cloud for peanuts.

This means that we couldn’t analyse and leverage the data in
the past, but we can today. The question then is how do you do it?

Bring all the data into one big enterprise bucket, and then
apply Hadoop to
it?

Possibly, but that does not work in many banks as they have everything
still structured in siloed boxes, some of which are segregated by law. For example, integrating the insurance data
with the banking data in a bancassurance group is still claimed to be a big
no-no.

That does not wash today however, and I suspect that
regulations are used as an excuse for inertia rather than being a real block. After all, Tesco Bank claim this will be their major opportunity:

“In our move from retailing products to bank retailing,
it amazes me that the current incumbents reward the new customer rather than
the existing one. That encourages promiscuity and commoditisation. If you can
reward the existing customer more than the new one, by learning more about
them, then you can price your products better. For example, our Clubcard (their
major loyalty program) data allows us to price our products 15% more accurately
than the Royal Bank of Scotland for any particular risk type by customer
segment. This means we can be the best at risk-adjusted pricing.”

In other words,
integrating retail data with financial data is not a big leap of thinking. It needs to be on a permissions basis however,
as I would not appreciate you offering me baby products if I didn’t know my
partner was pregnant or, worse, my daughter.

Target started sending
coupons for baby items to customers according to their pregnancy scores resulting
in an angry man going into a Target store in Minneapolis, demanding to talk to
the manager.

“My daughter got this in the mail!” he said.
“She’s still in high school, and you’re sending her coupons for baby clothes
and cribs? Are you trying to encourage her to get pregnant?”

The manager didn’t
have any idea what the man was talking about and apologised. He called a few days later to apologise again
but the father was somewhat abashed. “I had a talk with my daughter,” he said.
“It turns out there’s been some activities in my house I haven’t been
completely aware of. She’s due in August. I owe you an apology.”

Data analytics is the new battleground and the first step is
to get the data sorted for the purpose of the question you are trying to
answer.

Then there is another interesting aside: it’s not just the
internal data.

As we talked last night, many of the attendees felt the
hardest part would be organising the data internally, with Forrester saying that companies only use around 12% of the internal data available to
them.

But what about all the external data? When people leave digital footprints built
over years in Facebook, LinkedIn, Twitter, Tumblr, Flickr and more, then it
makes it far easier to track individual’s histories and identify them than ever
before.

That’s what criminals are finding, as referenced in the recent
report by Sophos who cracked open a criminal gang using malware in Russia
thanks to their social media footprints,
so shouldn’t we be using these for finding the criminals who launder or defraud?

It’s obviously not a simple thing however, as building data
banks that hold all the data about an individual in public domain and internally
would be a massive task … but today’s technologies allow you to tackle such
massive tasks. As mentioned, you can do what
Wal*Mart were doing twenty years ago for a few pennies today.

I guess the conclusion is that if data is the battleground,
then you need to arm yourselves with as much weaponry as possible and, for
those who invest the most in their warfare, the rewards will be increased
market share and decreased cost.

That’s as long as you know the question to be asked of
course.

“It’s not just about
looking for needles in haystacks, but removing some of the hay.” Martha Bennett,
Forrester

Technology OpinionCategories

Chris M Skinner

Chris Skinner is best known as an independent commentator on the financial markets through his blog, TheFinanser.com, as author of the bestselling book Digital Bank, and Chair of the European networking forum the Financial Services Club. He has been voted one of the most influential people in banking by The Financial Brand (as well as one of the best blogs), a FinTech Titan (Next Bank), one of the Fintech Leaders you need to follow (City AM, Deluxe and Jax Finance), as well as one of the Top 40 most influential people in financial technology by the Wall Street Journal's Financial News. To learn more click here...

Big Data is all about knowing what questions to ask

Share

Chris M Skinner