Explaining Big Data – Part 1

by Steve Gailey 7. August 2015 09:11


Unstructured Data


Every day it seems I read an article or a post about Big Data. Many of these articles are written by knowledgeable people working at the Big Data coalface, producing the tools and developing the techniques the rest of us will come to rely-upon and ultimately take for granted. An increasing number of these articles do seem to be written by people who, shall we say, have read a little about Big Data and have then jumped to some conclusions or who perhaps have just filled in the blanks for themselves.

The problem is that as the rest of us try to get to grips with this new branch of data science we often struggle to separate one type of source from another. Much like the challenges we have with our own data really. What I plan to do in this series of short articles is explain some of the basics of Big Data, hopefully in very clear and simple terms. Rather than tackling the subject as a whole I will tackle the building blocks so hopefully by the time I’ve run out of ideas you should be equipped to make judgements on the more scholarly articles you read on the broader subject.

I thought that I would start with the very much-misused topic of Unstructured Data. This is a term often bandied-about when discussing Big Data and it often leads to a lot of confusion and even argument.

The term unstructured data crept into the Big Data lexicon as we tried to distinguish Big Data approaches from that of relational databases. Some related terms you will hear are schema-on-write and schema-on-read (or schema-on-the-fly if you are a Splunk person), but I’m getting ahead of myself here…

So, let’s start with some controversy and let me say that there is no such thing as unstructured data! There, I’ve said it and I can’t take it back… All data has structure. That is the nature of data. Data without structure is called noise and it isn’t really data. It is the structure within data that makes it A. data and B. usable. No structure and you have no information to convey. Come to think of it I know some people like that…

So why do Big Data people get all excited about unstructured data then if it doesn’t exist? What they are excited about relates very much to those two other related terms. In the Big Data world unstructured data means not having to worry about the data’s structure until you need to ask the data a question, or analyse it. Compare that to the old world of relational databases where you had to clearly identify the data’s structure in order to build a schema for it to be able to ingest it into our database. Data that didn’t fit that schema simply wouldn’t go in. Data with very rich or complex structure thus didn’t fit well within a relational database or would be stored as Glob’s of unstructured data that couldn’t be analysed. This approach was called Schema-on-write. The data had to be made to fit the schema at the time it was written into the data store.

One of the great things about the big data approach is that we have done away with schema-on-write and replaced it with schema-on-read. What is this then? I shall tell you…

In a schema-on-read system the data is stored in a very different way to the old relational database system. Normally the data is stored in its native form, though it may be chopped up into manageable bite sized pieces. An index is then written to work alongside the data to allow it to be queried. Normally every “word” is indexed because we don’t know in advance what the data scientist is going to be interested in. Clearly the data and it’s index is going to be bigger than the original data source so we tend to use compression to reduce the size of all this data, but this is where big data starts to get big as the old relational database approach meant that we often threw a lot of data (which didn’t fit the schema) away. Now we are storing everything and an index to it as well.

So where does the schema-on read bit come in I hear you all ask? Well, when you need to query that data, or ask a question of it, you need to effectively apply your own schema to it. Or, put another way, you need to impose your own structure on the data, just for that one question. If you ask it a second question you may have to impose a completely different structure on the data in order to get your answer. And this is the beauty of schema-on-read. Your schema is transient. It lasts no longer than your query and it does not limit your data in the same way as schema on write because it overlays your unstructured data rather than forcing your data to fit its shape. Think of it as a template. The data that fits gets analysed and the data that doesn’t gets ignored.

So why should the rest of us get excited about this? The answer is simple - The ability to analyse unstructured data in this way gives us the ability to ask arbitrary questions of our data. In the old days we needed to know the questions, or at least the type of questions we needed to ask before we even built our system. This didn’t lead to much in the way of flexibility and it was one of the chief causes of the escalating costs of IT projects that constantly needed to be updated and amended. Now we can build our systems and worry about the questions later. And when the answer to one question leads to a host of new questions, our Big Data systems keep on answering those questions for us.

You ability to analyse your data in a Big Data system depends more on your query languages or analytical tools than to the nature of the storage engine then. And that is how it should be. Big data empowers the people who own the data or need answers from the data in a way that relational database technology never did. It is redressing the balance and putting the power back in the hands of those who understand what the data can tell them. So when you hear “unstructured data” you know that actually this is your liberation – unless you are a relational database engineer of course!

My mistake

by Steve Gailey 8. July 2015 16:27

I thought I'd found a good airline...

So it turns out that all airlines are as bad as each other. After years as a very (some would say overly) loyal BA customer I decided to change. The new BA executive club seems to be aimed at someone other than me and suddenly little perks like being able to book a specific seat have disappeared even though I’m a silver card holder. So, having taken advice I decided to switch my allegiance to KLM and make Amsterdam my hub airport.

Well, what a bad idea that turns out to be. My very first trip and I’m flying back into London City Airport and find myself at Amsterdam somewhat early, so I pay my €95 for an earlier flight and head through to the gate. Well if you know Schiphol airport you know it can be quite a hike to the gates so it took me about 40 minutes to get through security, passport control and all the way to gate 30. By the time I get there the flight is cancelled! I’m told that there weren’t enough people to make the flight worthwhile and I must book on another flight. So another couple of long walks later and I have done that and am back at the gate for a long wait for the original flight I was booked on.

No I’m sure that you are all thinking – no problem, KLM sold you a product which they then didn’t deliver, they will at least refund you the €95 you spent transferring to an earlier flight… Well no actually, as the earlier flight sold to me was operated by CityJet not KLM so KLM advised me to take it up with CityJet. CityJet on the other hand say it is nothing to do with them and I should go back to KLM who telm me that I should speak to CityJet.

Now I’m no expert in civil aviation but this seems to be somewhat crap! I’m the poor guy who just wants to get home to be with his family after a long and tiring business trip and these two organisations who apparently know each other well enough to operate flights for one another can’t even seem to have a conversation when something goes wrong. I’m now looking back at BA and whilst I still think they are an organisation that doesn’t care much about their customers (with the exception of in the air where the BA people are excellent) I suspect that BA are far from alone in their general crapness. It would appear that KLM and CityJet are equally crap, and don’t even get me started on Air France who delivered me not only to the wrong Italian city but the wrong region…

Other organisations don’t seem to treat their customers like this, why is it just airlines?

Fixing British Politics

by Steve Gailey 8. December 2014 10:50


With a general election looming in the United Kingdom I thought that I would share with you two very simple ideas to improve the quality of government we have to endure. This is nothing to do with the fact that it seems likely now that the balance of power will be held by the SNP who will ignore the decision of the Scottish people to drive through independence for Scotland as their price for propping up a minority Labour government. Which at this stage is speculation on my part, but lets see… No, this is something far more general which I think would improve the quality of all politicians of every party. Something to be lauded, I’m sure you will agree?

So the first idea is very simple and it is to ensure that politicians are actually qualified to do their jobs and understand the running of government. I would make it a requirement of anyone who joins the executive branch of government – that is who is a member of the cabinet, to pass the Civil Service exam before taking up their post. MP’s should be encouraged to take the exam as soon as they become MP’s or even when they are just parliamentary candidates, to ensure that they are ready in advance. The taking and passing of this exam, which is required for all those administrators who ensure that the countries runs, seems like an obvious requirement for members of parliament who wish to make changes to how things run. You only need to look at some of the failed policies of the past to see that a little education may have saved the country both money and time and a great deal of pain.

The second idea may be a little more controversial, but here goes… All parliamentary candidates, that is anyone wishing to stand as an MP, should first have completed no less than three years service in either one of the emergency services, a branch of the military or in the medical profession at least as a nurse. MP’s are for ever telling us that being an MP is service, but lets face facts, it’s not. Being a nurse, a policeman, fireman or soldier is service. Given that we are all being asked to work until we are about 70 I think three years of real service before you can be a politician is actually very little to ask and it would certainly ensure that they have a little more understanding of real people and their lives. We hear about compassion in politics nowadays but I have seen very little of it. Lets ensure that our politicians have a chance to really understand this before they get to govern us.

These two ideas can be phased in over the life of two or three parliaments easily without them disrupting the business of running Britain. We could even decide to make this simply for new parliamentary candidates to allow some of the old beasts who frequent the House of Commons to continue. It would be interesting to watch the change that rules such as this might make to compassionate politics over time. Unlike most  laws, it is very unlikely to do any harm.

Let me know what you think of this – I want to hear on Linkedin if you like these ideas, have any similar ideas of your own or if you see problems I haven’t considered…

Lets make Britain a better place; regardless of which political party happen to be running it at the time!



by Steve Gailey 8. November 2014 10:07



Everyone accepts that the concept of community, at least in the developed world, has suffered as technology has developed. Certainly I remember as a small boy constantly being in the neighbours houses, my parents frequently chatting with them over the fence and our front door always open with people popping in and out seemingly all day long. Everyone knew everything about everyone. If someone had a problem then people would rally around to help. There were very few secrets of course; that level of familiarity makes it almost impossible, but the upsides largely outweighed the down.

This type of close and supportive local community, whilst not gone from our world is certainly less common now. Imagine then how excited I was to realise that it is still alive and well and expanding in cyber-space. Far from everything online being bad, the level of community in some areas mirrors the old style local communities I remember from my childhood. Oh it is slightly different now. Having someone’s children access your computer is somehow worse than having them access your house…

Last week I was enjoying a staycation (a weeks holiday at home) and decided to do some work on my Land Rover. I had bought all the parts to fit a second battery (to power auxiliary equipment) and a split charging system to ensure that the alternator charged both but that they were separated with the engine off to ensure that I couldn’t flatten my starting battery. The installation itself was pretty simple, except for the fact that the two batteries didn’t initially fit in the compartment together so I had to turn them sideways and then fashion a new mounting bracket on my forge (doesn’t everyone have a home forge) to hold them in place. After that I wired in the split charging system took her for a spin.

To my frustration the 180A relay linked and unlinked the two batteries continuously on a 5 second cycle. I checked my wiring and everything seemed to be as described so I sought out a community to help me. It didn’t take long to find a couple of online forums for Land Rover drivers and after joining I posted my questions. Within minutes I had help and suggestions about possible problems from people who had done the same thing. Armed with my new found information I soon identified the problem, a loose connection in the battery charger controller, and everything was as it should be.

Had my problem been with something else I’m certain that I could have found the appropriate community to reach out to for help. The community may no longer be made up of my close friends and neighbours but now can span the world and every member is an expert in whatever my problem might be that day. There is still no privacy, if someone wishes to find me on Linkedin, Facebook or Twitter so not much has really changed there.

The spirit of community is strong in the human psyche and it didn’t die, it just moved online. Community is alive and well, it has just moved with the times. Now please excuse me as I have some questions to answer on some online forums.


Splunking Email

by Steve Gailey 12. September 2013 17:14


One of the things people have asked me about this topic is “why”, why Splunk your emails in the first place? I thought I’d share a fuzzy screenshot that shows a small dashboard I put together to allow me to look at about seven years of emails. I can do interesting but not particularly useful things like see at what time am I getting emails from people in New York, or from a particular group of people, and what time am I sending to those people.

Slightly more usefully I can see who emails me most and on what subject, and similarly who I am sending emails to and what subject. The most useful thing remains being able to find that very elusive email that you know you sent (or were sent). Splunk always was the tool to find that needle in a haystack and seven years worth of emails represents a pretty big haystack! Naturally you can drill-down into the detail you need in this dashboard.

Happy Splunking


It has seemed like a fairly good idea to me for a while to Splunk my email. How good would it be to be able to analyse who was sending you emails about audit or to immediately find that email you were sent last year about bonuses that had an excel attachment? Better yet, how about analysing who you are communicating with most and what about… Sounds hard to do but that is exactly what Splunk is good at. Some pretty big orgainsations are doing it with their entire company email feed and using the Splunk App for Microsoft exchange to make sense of it all, but I needed something a little more pedestrian. Now the tricky bit; how do you get all your personal emails into Splunk? Well I obsessively save emails so I have lots and lots of PST files full of emails that seemed important at the time. Outlook is pretty poor at processing these large files. Try having a few big ones open from a network share if you don’t believe me…

The trouble is that Splunk can’t interpret PST files. To be honest Microsoft struggles so what chance do the rest of us have? Well now that I don’t work seventeen hours a day I find myself at a loose end at the weekends so I thought I’d tackle this problem. I had already written an application to read PST files (see my previous post) and search them at high speed in memory so I had the beginnings of what I needed. I just needed to do the same sort of thing but to write the headers out to a CSV file for Splunk to process.

Having looked at my old PST files I realised that I had 9 of them totalling about 13GB of old data. So that answered the next question about requirements; I needed to process several files at once, which meant the program needed to be multi-threaded. Actually it all proved remarkably simple to put together and in one afternoon I had a working utility which allowed me to select any number of PST files and which then loaded and worked on them in parallel reading each message in turn and writing the headers out to a file. Fortunately my home built machine is a six core AMD box which this little application kept pretty busy when I ran it for the first time.

I wrote it to update the user on progress as sitting waiting for a file to appear is not the interactive experience I desire. Having created the 9 CSV files I loaded them into my home Splunk server (what do you mean you don’t have one) and in seconds they were indexed and I could start querying. I have to say that it was worth the small effort involved. I can very quickly search by subject, sender, recipient, CC, attachments or anything else that is in the header. I can easily graph out the trends and relationships but best of all I can easily find that elusive email I knew that I had received and having identified it I can go back to the original PST and extract the attachment or read the email. I would put wonderful Splunk screenshots up but I don’t really want you to see my emails, so I suggest you try it for yourself.

PSTHeaders.zip (3.49 mb) Here is the application to extract your PST headers. Let me know if you have any issues with it. Just unzip and run setup to install the application.

Happy Splunking.



Splunk | Technical


<<  July 2017  >>

View posts in large calendar

Page List


Comment RSS