Persistence
is the ability for data to live beyond the lifecycle of a process. Software as
part of interaction generate a lot of data which needs to live beyond them. In
the beginning, the data storage systems were file oriented. However files did
little to provide a good structure to the stored data. The other main issue
with files was there proprietary format, which made it impossible to read them
in the absence of certain programs. Also processing of data in file across
versions of software was another challenge to deal with, as run time structures
to hold the data do change across releases. Also in terms of scalability, files
were not very good in terms of searching and doing local updates on data. To
solve this, industry came up with RDBMS type of system.
RDBMS is a very structured way of
defining the data. The rigorous structuring and ability to constrain provided
the capability to keep the integrity of data also popularly known as ACID
properties. For the amount of data the world was dealing with. RDBMS systems
were sufficient and provided the required scalability to an extent. This was
the time when to get indexed in a search engine one has to submit the web link
to search engines. People managed higher scalability with concepts like
sharding however at the cost of increased complexity. (Developers love to make
things complex).
Google came and changed the world. The
amount of data they start dealing with just got larger by any sane means of
measurement. They inverted the concept of web indexing. Rather then people
submitting links, they started crawling web, downloading pages locally and
indexing them. However with this approach came the problem of large data
popularly termed as "Big Data" now. There was a parallel development
happening. Doug cutting was working on a search engine infrastructure project
called Nutch however he ran into issues of scaling it further. A paper from
Google around that time on how Google handles Big Data using GFS or Google File
System. Based on the paper, Doug started working on Hadoop on the line of GFS
and used the concept of Map Reduce. Around those time, Doug was employed by
Yahoo and the project got he required funding and resources. Today Hadoop is a
clear winner in terms of handling Big Data problems and is used in my
enterprises who deal with Big Data. The core of Hadoop is the ability to push
the processing logic to the data itself without moving a lot of data around.
This is different from technologies like Grid computing which move both the
data and processing logic. Compared to Hadoop, RDBMS systems do the opposite of
pulling the data to the processing logic. Having said that it does not means
that we are going to throw RDBMS and adopt Hadoop everywhere. Both tend to complement
each other and have their own unique strength and weaknesses.
A high level comparison of Hadoop is as follows:
Data integrity low
|
Data
integrity high (ACID compliant)
|
Non structured data
|
Structured Data
|
Linear/Horizontal scaling
|
Veritcal/non linear scaling
|
good for initial insert
and then multiple reads
|
Good for multiple updates
|

0 comments:
Post a Comment