This post originally appeared in the December issue of the Informed Librarian Online.
Data, data , data…everywhere data! It’s in blogs, social networks, websites, digital libraries, Twitter, Tumblr, Google, Amazon, your online library catalogue. It’s in this column. There’s a huge amount of data out there… one could almost describe it as big.
Big data is one of those buzzwords that don’t seem to be going away and it’s not going to disappear in the near future. Why? Because big data is only going to get bigger. According to IBM “everyday, we create 2.5 quintillion bytes of data–so much that 90% of the data in the world today has been created in the last two years alone.”(1)
In this post I am going to provide you with an overview of:
- what big data is and where it comes from?
- what the big deal is all about, and
- most importantly what it means for library and information professionals?
What is big data?
At the simplest level big data is a term to describe a lot of data. Ingenious isn’t it? Almost as informative as the definition for metadata – data about data.
So how do the experts describe big data?
‘Big data, in general, is defined as high-volume, high-velocity, and high-variety assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making’ – Gartner Inc (2)”Big data’ is the term for a collection of datasets so large and complex that it becomes difficult to process with traditional database tools or data processing applications’ – Chris Sherman, Onlinesearcher.net (3)“Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.” Edd Dumbill, O’Reilly (4)
But what is it made of? In the open source book HadoopIlluminated provides some examples of what makes up big data:
“Web Data: still it is big dataSocial media data: Sites like Facebook, Twitter, LinkedIn generate a large amount of dataClick stream data: when users navigate a website, the clicks are logged for further analysis (like navigation patterns). Click stream data is important in on line advertising and and E-CommerceSensor data: sensors embedded in roads to monitor traffic and misc. other applications generate a large volume of dataConnected Devices: Smart phones are a great example. For example when you use a navigation application like Google Maps or Waze, your phone sends pings back reporting its location and speed (this information is used for calculating traffic hotspots). Just imagine hundreds of millions (or even billions) of devices consuming data and generating data.” (5)
Whilst there is a massive amount of content on the internet that isn’t what big data is about. Big data is the metadata behind the content . Thinking about the amount of content available on the internet is mind boggling. However , the amount of metadata out there is that turned up to eleven.
Just to give you an idea of the massive amount of big data out there back in 2010 the company DataSet released an example of the amount of data available from a single tweet. From every single tweet there are thirty fields of data. A picture of all the fields is available here: http://readwrite.com/2011/11/16/what_a_tweet_can_tell_you If you think about how many tweets are posted every second of everyday you can start to appreciate just how big the data is that we are talking about.
What’s the point of it?
So what’s the big deal? The big deal about big data isn’t just the fact that there is a lot of data out there in the world. Rather what big data is focused upon is the opportunity that the machine created data presents as an information resource and our ‘new found’ capability to harness it.
Due to technological development we now have the capability to make sense of the supercalafragalisticexpialadoshus-ly massive amounts of data that are being produced every day. Currently the accessibility of unique big data as a mechanism for understanding your users is a limited to those with very VERY big budgets. However as we have seen with many other technologies this expense is not going to persist forever and the technology is only going to become more accessible, meaning cheaper.
The current expense of big data is associated with the task of collecting and sorting the data , not with the results of its analysis. The excitement surrounding big data is dominantly associated with the human analytical factor of big data.
Once data is collected and sorted the true value of big data comes from the ability to recognise valuable information and patterns within the data and identify the action necessary to meet the need. With big data the best mindset with which to look at the data sets is an open one. Rather than seeking an answer to a targeted problem or issue with big data you need to look to big data with an open mind. Look to big data with a general idea or intent, but also with an openness to recognise the unexpected as the real opportunity of big data is allowing the data to identify the need for you.
Not going to be able to get approval for big data projects in the near future? Well you can still play in the world of big data using public datasets. I do feel I should warn you now – unfortunately it’s not like you are in the matrix. Reminiscent of a spreadsheet or graph there are a range of public data sets that show the type of information that all the fuss is made about.
Google has a range of data sets available via a public directory (http://www.google.com/publicdata/directory). Big data is big picture – you can look at the each countries technological readiness and use of ICTs with the Global Competitiveness Report or compare countries forecast population growth with the International Monetary Fund data set.
Hadoop Illuminated includes a list of public Big Data sets to give you an idea of the type of information available and what working with big data looks like. These are available here:http://hadoopilluminated.com/hadoop_book/Public_Bigdata_Sets.html
What does it mean for library and information professionals?
Big data means opportunity. As a profession our skill sets marry very neatly with those required for making the most out of big data.
Firstly, we have the database and metadata skills to ensure that the big data collected is valuable. With big data there is a potential for the data collected to not provide any value add. To prevent the big data process from bringing in rubbish there needs to be an individual with the skills to identify and collect valuable data – a library and information professional. We can design the framework (metadata) to facilitate the collection of valuable data.
Secondly, the analytical mindset embedded within the profession serves as the best weapon to combat large amounts of information. With a user focus we can identify opportunities, gaps and needs within the information collected.
- Big data is here to stay and is only going to become more and more accessible
- The value of big data is dependent upon the metadata behind it
- Keep a look out for the opportunity to embed your skill sets in a big data project
- With big data the opportunity is there – you just need to make something of it
(1) IBM, What is big data? URL: http://www-01.ibm.com/software/au/data/bigdata/
(2) From The Big Data Explosion: Maximizing information value while minimizing risk (2013) Information Management, Volume 42 Issue 2, Proquest Library Science, p s2
(3) Chris Sherman, Online Searcher 38.2 (Mar/Apr 2014): 10-16.
(4) Edd Dumbill, What is big data URL: http://radar.oreilly.com/2012/01/what-is-big-data.html