Explaining Big Data – Part 1

by Steve Gailey 7. August 2015 09:11


Unstructured Data


Every day it seems I read an article or a post about Big Data. Many of these articles are written by knowledgeable people working at the Big Data coalface, producing the tools and developing the techniques the rest of us will come to rely-upon and ultimately take for granted. An increasing number of these articles do seem to be written by people who, shall we say, have read a little about Big Data and have then jumped to some conclusions or who perhaps have just filled in the blanks for themselves.

The problem is that as the rest of us try to get to grips with this new branch of data science we often struggle to separate one type of source from another. Much like the challenges we have with our own data really. What I plan to do in this series of short articles is explain some of the basics of Big Data, hopefully in very clear and simple terms. Rather than tackling the subject as a whole I will tackle the building blocks so hopefully by the time I’ve run out of ideas you should be equipped to make judgements on the more scholarly articles you read on the broader subject.

I thought that I would start with the very much-misused topic of Unstructured Data. This is a term often bandied-about when discussing Big Data and it often leads to a lot of confusion and even argument.

The term unstructured data crept into the Big Data lexicon as we tried to distinguish Big Data approaches from that of relational databases. Some related terms you will hear are schema-on-write and schema-on-read (or schema-on-the-fly if you are a Splunk person), but I’m getting ahead of myself here…

So, let’s start with some controversy and let me say that there is no such thing as unstructured data! There, I’ve said it and I can’t take it back… All data has structure. That is the nature of data. Data without structure is called noise and it isn’t really data. It is the structure within data that makes it A. data and B. usable. No structure and you have no information to convey. Come to think of it I know some people like that…

So why do Big Data people get all excited about unstructured data then if it doesn’t exist? What they are excited about relates very much to those two other related terms. In the Big Data world unstructured data means not having to worry about the data’s structure until you need to ask the data a question, or analyse it. Compare that to the old world of relational databases where you had to clearly identify the data’s structure in order to build a schema for it to be able to ingest it into our database. Data that didn’t fit that schema simply wouldn’t go in. Data with very rich or complex structure thus didn’t fit well within a relational database or would be stored as Glob’s of unstructured data that couldn’t be analysed. This approach was called Schema-on-write. The data had to be made to fit the schema at the time it was written into the data store.

One of the great things about the big data approach is that we have done away with schema-on-write and replaced it with schema-on-read. What is this then? I shall tell you…

In a schema-on-read system the data is stored in a very different way to the old relational database system. Normally the data is stored in its native form, though it may be chopped up into manageable bite sized pieces. An index is then written to work alongside the data to allow it to be queried. Normally every “word” is indexed because we don’t know in advance what the data scientist is going to be interested in. Clearly the data and it’s index is going to be bigger than the original data source so we tend to use compression to reduce the size of all this data, but this is where big data starts to get big as the old relational database approach meant that we often threw a lot of data (which didn’t fit the schema) away. Now we are storing everything and an index to it as well.

So where does the schema-on read bit come in I hear you all ask? Well, when you need to query that data, or ask a question of it, you need to effectively apply your own schema to it. Or, put another way, you need to impose your own structure on the data, just for that one question. If you ask it a second question you may have to impose a completely different structure on the data in order to get your answer. And this is the beauty of schema-on-read. Your schema is transient. It lasts no longer than your query and it does not limit your data in the same way as schema on write because it overlays your unstructured data rather than forcing your data to fit its shape. Think of it as a template. The data that fits gets analysed and the data that doesn’t gets ignored.

So why should the rest of us get excited about this? The answer is simple - The ability to analyse unstructured data in this way gives us the ability to ask arbitrary questions of our data. In the old days we needed to know the questions, or at least the type of questions we needed to ask before we even built our system. This didn’t lead to much in the way of flexibility and it was one of the chief causes of the escalating costs of IT projects that constantly needed to be updated and amended. Now we can build our systems and worry about the questions later. And when the answer to one question leads to a host of new questions, our Big Data systems keep on answering those questions for us.

You ability to analyse your data in a Big Data system depends more on your query languages or analytical tools than to the nature of the storage engine then. And that is how it should be. Big data empowers the people who own the data or need answers from the data in a way that relational database technology never did. It is redressing the balance and putting the power back in the hands of those who understand what the data can tell them. So when you hear “unstructured data” you know that actually this is your liberation – unless you are a relational database engineer of course!


<<  February 2018  >>

View posts in large calendar

Page List


Comment RSS