Structured, semi structured and unstructured data

in #bigdata7 years ago

Three concepts come with big data : structured, semi structured and unstructured data. 

Structured Data

For geeks and developpers (not the same things ^^) Structured data is  very banal. It concerns all data which can be stored in database SQL   in table with rows and columns. They have relationnal key and  can be  easily mapped into pre-designed fields. Today, those datas are the most  processed in development and the simpliest way to manage informations. But structured datas represent only 5 to 10% of all informatics datas. So let’s introduce semi structured data. 

Semi structured data

Semi-structured data is information that doesn’t reside in a  relational database but that does have some organizational properties  that make it easier to analyze. With some process you can store them in  relation database (it could be very hard for somme kind of semi  structured data), but the semi structure exist to ease space, clarity or  compute… Examples of semi-structured : CSV but  XML and JSON documents are  semi structured documents,  NoSQL databases are considered as semi  structured. But as Structured data, semi structured data represents a few parts  of data (5 to 10%) so the last data type is the strong one :  unstructured data. 

Unstructured data

Unstructured data represent around 80% of data. It often include text  and multimedia content. Examples include e-mail messages, word  processing documents, videos, photos, audio files, presentations,  webpages and many other kinds of business documents. Note that while  these sorts of files may have an internal structure, they are still  considered « unstructured » because the data they contain doesn’t fit  neatly in a database. Unstructured data is everywhere. In fact, most individuals and  organizations conduct their lives around unstructured data. Just as with  structured data, unstructured data is either machine generated or human  generated. Here are some examples of machine-generated unstructured data:  

  • Satellite images: This includes weather  data or the data that the government captures in its satellite  surveillance imagery. Just think about Google Earth, and you get the  picture. 
  • Scientific data: This includes seismic imagery, atmospheric data, and high energy physics. 
  • Photographs and video: This includes security, surveillance, and traffic video. 
  • Radar or sonar data: This includes vehicular, meteorological, and oceanographic seismic profiles. 

The following list shows a few examples of human-generated unstructured data:  

  • Text internal to your company: Think of all  the text within documents, logs, survey results, and e-mails.  Enterprise information actually represents a large percent of the text  information in the world today. 
  • Social media data: This data is generated from the social media platforms such as YouTube, Facebook, Twitter, LinkedIn, and Flickr. 
  • Mobile data: This includes data such as text messages and location information. 
  • website content: This comes from any site delivering unstructured content, like YouTube, Flickr, or Instagram. 

And the list goes on. The unstructured data growing quickiest than the other, and their exploitation could help in business decision. A group called the Organization for the Advancement of Structured Information Standards (OASIS)  has published the Unstructured Information Management Architecture  (UIMA) standard. The UIMA « defines platform-independent data  representations and interfaces for software components or services  called analytics, which analyze unstructured information and assign  semantics to regions of that unstructured information. » Many industry watchers say that Hadoop has become the de facto industry standard for managing Big Data

Sort:  

Hi! I am a robot. I just upvoted you! I found similar content that readers might be interested in:
https://lpuguidecom.files.wordpress.com/2016/10/17783_big-data-notes1.ppt