Showing posts with label Howto. Show all posts
Showing posts with label Howto. Show all posts

Tuesday, 11 April 2017

Big Data - Kapil Sharma (101)

Big Data consists of five types:

1. Volume

2. Velocity 

3. Variety 

4. Veracity 

5. Value  

More than 90% of big data set created in last 3+ years.

To process this huge data set of unstructured data, big data frameworks comes in picture. 

To retrieve value from processing, computing and analysis data leads to the unique value addition. 




Wednesday, 25 November 2015

Basic Data Science Introduction (C-01) - Kapil Sharma

Vectors: 
The vector is a very important tool in R programming. Through vectors we create matrix and data-framesVectors can have numeric, character and logical values. The function c() is used to create vectors in R programming.

x <- c(2,22,"xyz", -4)


Factors:

They are same like vectors but they have different meaning.
 y <- c(1,2,3,4,5,6,7)
yf<- factor(y)
yf

Lists:

They are vectors, but they consists of different data sets.
a <- c(dog = "pitbull", age = 100, color = "golden", weight = TRUE)

Matrices:

They are vectors with more than one dimensions, consists of rows and columns (ncol,nrow).
They can be rowbind (rbind()) or column bind (cbind())

# Create matrix with 4 elements:

cells <- c(3,5,16,29)
colname <- c("Jun", "Feb")
rowname <- c("Nut", "Orange")
y <- matrix(cells, nrow=2, ncol=2, byrow=TRUE, dimnames=list(rowname, colname))

              Jun   Feb

Nut         3      5
Orange  16    29

Datasets:

It is same like matrix, but it also consists of numeric and character elements.
Location <- c("Mandi", "Manali")
Distance <- c(200, 307)
df <- data.frame(a,b)
df

Location Distance

Mandi 200
Manali 307


Wednesday, 5 November 2014

What is Hadoop - Kapil Sharma

What is Hadoop?

Hadoop - is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing.

Currently three core components are included with your basic download from the Apache Software Foundation.


HDFS - the Java-based distributed file system that can store all kinds of data without prior organization.

MapReduce – a software programming model for processing large sets of data in parallel.

YARN – a resource management framework for scheduling and handling resource requests from distributed applications.