Big Data & BI Archives - TatvaSoft Blog https://www.tatvasoft.com/blog/category/big-data-bi/feed/ Thu, 06 Jun 2024 11:11:34 +0000 en-US hourly 1 Types of NoSQL Databases https://www.tatvasoft.com/blog/types-of-nosql-databases/ https://www.tatvasoft.com/blog/types-of-nosql-databases/#respond Tue, 24 Jan 2023 07:15:42 +0000 https://www.tatvasoft.com/blog/?p=9629 NoSQL databases have seen a surge in popularity since its global introduction. Developers looking for the alternatives to the rigid architecture of relational databases can consider NoSQL for its adaptability and scalability.

The post Types of NoSQL Databases appeared first on TatvaSoft Blog.

]]>
NoSQL databases have seen a surge in popularity since its global introduction. Developers looking for the alternatives to the rigid architecture of relational databases can consider NoSQL for its adaptability and scalability.

Many software development companies utilize NoSQL databases to keep track of information like operational parameters, functionalities sets and models’ metadata. However, they are useful to data engineers for archiving and recovering data.

Before delving into NoSQL’s significance, let’s discuss the types of NoSQL, its features, benefits and drawbacks so that you can divide them appropriately according to your project objectives and needs.

1. What is NoSQL Database?

In lieu of rows and columns, the NoSQL database systems store data as JSON files. To clarify, NoSQL stands for “Not only SQL”. It refers to any “non-relational database”.  This implies that a NoSQL database may keep and access data without utilizing SQL or you may mix the flexibility of JSON with the capability of SQL. NoSQL databases are therefore designed to be adaptable, accessible, and adept at swiftly reacting to the data management needs of modern enterprises. Traditional Relational database uses SQL syntax to store, manage, and retrieve data while NoSQL database system uses wide range of database technologies that can access structured, semi-structured, and unstructured data with equal effect.

2. Types of NoSQL Database

Following are descriptions of the four most common NoSQL database kinds:

  • Document databases
  • Key-value stores
  • Column-oriented databases
  • Graph databases

2.1 Document Databases

A document database holds information in a document format such as JSON, BSON, or XML (not Word documents or Google Docs). In a document database, files can be stacked. Certain items can be indexed to facilitate speedier querying.

Documents may be saved and accessed in a manner that is far nearer to the data items used in software, requiring less interpretation to utilize the information in an application. Frequently, SQL data must be constructed and dismantled while traveling across apps and storage.

Document databases are preferred with engineers since their document formats may be reworked as required to fit the program, and their data structures can be shaped as prerequisites that evolve over time. This flexibility accelerates the software development process since data are effectively treated as code and are under the command of developers. To modify the structure of a SQL database, database administrators may be necessary to intervene.

The most extensively used document databases often have a scale-out design, which provides a clear route to flexibility in terms of both data capacity and traffic.

Industry-specific use cases comprise ecommerce systems, online trading, and mobile application development.

1. Key Features of Document Databases:

  • Configurational pliability : The database’s documents follow an adaptable structure which essentially means that the documents in the database might have different schemas.
  • Minimize the time required for both development and maintenance : Once a document has been created, there is very little work involved in keeping it up to date.
  • No foreign keys : Because papers are not bound to one another in any way, they can exist independently of one another. Foreign keys are thus unnecessary in a document database.
  • Freely available formats : Documents are created with XML, JSON, and other forms.

2. Advantages of Document Databases

  • Open and scalable data model, and without any “foreign keys”

3. Disadvantages of Document Databases

  • Limiting searches to primary keys and indexes is a drawback, and you’ll need to use MapReduce for complex inquiries.

Example:

JSON 
[
    {
        "year" : 2021,
        "title" : "Eternals",
        "info" : {
            "director" : "Chloé Zhao",
            "IMDB" : 6.3,
            "genres" : ["Science Fiction", "Action"]
        }
    },
    {
        "year": 2022,
        "title": "Doctor Strange in the Multiverse of Madness",
        "info": {
            "director" : "Sam Raimi",
            "IMDB" : 7.0,
            "genres" : ["Science Fiction", "Action", "Superhero"]
        }
    }
]

2.2 Key-Value Stores

A key-value store database is the most elementary form of NoSQL database. Attribute names (or “keys”) and their associated values (or “values”) are used to represent each and every piece of information in the database. In the Key-value stores database, every element is stored as a key value pair. For Example, the key or attribute name like “city” and the data or value like “Bangalore”.  Ecommerce carts, user information, and choices are some of the  examples of possible applications.

1. Key Features of the Key-Value Store

  • Easily scalable
  • Mobility
  • Rapidity

2. Advantages of Key-Value Store

  • Value may be expressed in a variety of formats, such as JSON, XML, and flexible schemas, and the underlying data model is simple, scalable, and easily understood.
  • Because of its ease of use, it can process data quickly, and it works best when the underlying information is not closely connected.

3. Disadvantages of Key-Value Store

  • There are no connections; you must generate your own foreign keys.
  • Lacks scanning abilities; not great for anything but CRUD; not suited for complicated data (create, read, update, Delete )

Example:

Key Value Store Database

2.3 Column-Oriented Databases

A column store, in contrast to a relational database, is structured as a series of columns, rather than rows. This allows you to execute analytics on a subset of columns without worrying about the rest of the data eating up storage. Read speeds are improved due to the fact that columns of the same kind may be compressed more effectively. The value of a column may be easily aggregated in columnar databases. Analytics is a common application of column-oriented databases.

Whereas columnar databases excel at analytics, its inability to be firmly consistent is a major drawback due to the fact that updating all columns necessitates numerous writes to disk. Due to the row data being copied sequentially to disk, relational databases are immune to this issue. Column oriented databases are widely used to manage data warehouses, CRM, business intelligence data, etc. Some of the column oriented database examples are Hbase, Cassandra, and Hypertable.

Further Reading on Hbase vs Cassandra

1. Key Features of Columnar Oriented Database

  • Extensibility
  • Flexion
  • Receptive to the slightest of prompts

2. Advantages of Columnar Oriented Database

  • Scalability
  • Natural indexing
  • Support for semi-structured data
  • Access time

3. Disadvantages  of Columnar Oriented Database

  • Cannot be used with relational data

Example:

Suppose, A database has a table like this:

RowId StudentName Maths Marks Science Marks
001 John 98 85
002 Smith 85 99
003 Adam 75 85
Column-Oriented Database

2.4 Graph Databases

A graph database is designed to highlight the connections between data points. An individual “node” represents each piece of information.  Links or relations are the interconnections across multiple elements of a total. Connections are directly recorded as first-class items in a graph database. Data connections are expressed through the data itself in relational databases, thus ties are assumed rather than explicitly written.

Because of the inefficiency of joining many tables in SQL, a graph database is better suited to storing and retrieving the relationships between data items.

In practice, only a small handful of enterprise-level systems can function well using only graph queries. Consequently, graph databases typically coexist with other, more conventional types of NoSQL databases. Cybercrime, social media, and knowledge graphs are some of the applications of it.

Even though they share a name, NoSQL databases are quite different from one another in terms of their underlying data structures and potential uses.

1. Key Features of Graph Database

  • One of the main features of a graph database is that it is straightforward to see how various pieces of information are connected to one another by way of the hypertext connections between them.
  • The output of the Query is current, up-to-the-moment information.
  • How quickly anything happens is proportional to the complexity of the interconnections between the various parts of the database.

2. Advantages of Graph Database

  • Super-effective
  • Locally-indexed connected data 
  • ACID support
  • Instantaneous output
  • Flexible architecture

3. Disadvantages of Graph Database

  • Scaling out is challenging, however scaling up is possible

Example:

Employee Table:

Emp_ID Employee Name Age Contact Number
001 John 25 9475858574
002 Smith 26 7485961231
003 Adam 24 7412589634
004 Johnson 22 9874563521

Employee Connections Table:

Emp_ID Connection_ID
001 002
001 003
001 004
002 001
002 003
003 001
003 002
003 004
004 001
004 003
Graph Database

3. When to Use Which Type of NoSQL Database?

If you need to store and represent a wide variety of data types—including structured, semi-structured, and unstructured data—in a single database, you should look into a NoSQL database. In addition, NoSQL databases are more adaptable since the data we keep in them does not require a pre-established structure, as is the situation with SQL databases. Choosing the right NoSQL database for a given application can be challenging because each kind has its own unique characteristics. Consequently, it is important to get a sense of typical applications before making a database choice.

Please get in touch with our technical team if you need any assistance on that.

The post Types of NoSQL Databases appeared first on TatvaSoft Blog.

]]>
https://www.tatvasoft.com/blog/types-of-nosql-databases/feed/ 0
Hbase vs Cassandra: Key Comparison https://www.tatvasoft.com/blog/hbase-vs-cassandra/ https://www.tatvasoft.com/blog/hbase-vs-cassandra/#respond Wed, 07 Sep 2022 06:43:09 +0000 https://www.tatvasoft.com/blog/?p=8745 Traditional databases whether it is SQL or No SQL database, all of these have updated their conventional approach for data storage. You as a business will see how the data storing capabilities have evolved with the time. Now, storages are no longer tabular-based. There are a plethora of ways through which you can execute, and manage your databases.
Apache Cassandra and Apache HBase are two popular database model types that can be used to store, manage and extract information making the best use of data. But if we are comparing Hbase vs Cassandra then, there is something they have in common. Not something, many things. They look identical and possess similar characters and functions. However, if you look at it deeply, you’ll find major differences in the way they function. That's what we'll discover here.

The post Hbase vs Cassandra: Key Comparison appeared first on TatvaSoft Blog.

]]>
1. Introduction of Both the Databases

Traditional databases whether it is SQL or No SQL databases, all of these have updated their conventional approach for data storage. You as a business will see how the data storing capabilities have evolved with time. Now, storages are no longer tabular-based. There are a plethora of ways through which you can execute, and manage your databases.

Apache Cassandra and Apache HBase are two popular database model types that can be used to store, manage and extract information making the best use of data. But if we are comparing Hbase vs Cassandra then, there is something they have in common. Not something, many things. They look identical and possess similar characters and functions. However, if you look at it deeply, you’ll find major differences in the way they function. That’s what we’ll discover here. Before going any where else lets see what professionals at Quora is recommending us.

Hbase vs Cassandra Quora

Unlike Quora, users of Stackoverflow are discussing in more logical way about HBase and Cassandra.

Stackoverflow HBase vs Cassandra

Like some of the databases use big data applications and some of them use schemas, wide columns, graphs, and other documents from the stores. All this has now changed to a widely used in big data and real-time web applications. In this blog post, we aim to bring forward the difference between the two- Hbase and Cassandra databases and will be discussing a detailed comparison between them in terms of architecture, support, documentation, SQL Query language, and several other details. Motive of this post is to give more information and insights about these two databases so that software development companies and business owners can easily select between these two. So, without much ado, let’s get started.

2. What are Hbase and Cassandra?

To start with, Hbase has its renowned way to manage data. This model is popularly used to provide random access to a large amount of structured data. It is column-oriented and built on top of the Hadoop distributed file system. This application works in real time and can be used to store data in HDFS. Hbase is an open-source distributed database that allows for simpler ways to eliminate data replication. There are other essential components of Hbase which include HMaster, Region Server, and Zookeeper. According to GitHub data of HBase , Github Stars is 4.6K and GitHub Fork – 3.1K.

Github HBase

Let’s take a quick overview of Cassandra’s query language as well.

Cassandra is designed to handle large amounts of data across multiple commodity servers, ensuring high availability without failure. It has a distributed architecture that can handle large amounts of data. To achieve high availability without failure, data is distributed across multiple machines using multiple replication factors. According to GitHub data of Cassandra, GitHub Stars is 7.5K, GitHub Fork is 3.2K.

GitHub Cassandra

These were just some introductory aspects, we will now be discussing the actual difference between HBase and Cassandra.

2.1 Architecture

Of the many database management systems, HBase comes with master-based architecture, whereas Cassandra doesn’t have a master thus it is a masterless one. It means that HBase has a single point of failure, whereas Cassandra does not. The Apache HBase client communicates directly with the slave-server without contacting the master; this provides a working time when the master is unavailable.

The Hbase model is based on Master-Slave Architecture Model. While Cassandra is based on Active Node Architecture Model. Furthermore, in the Cassandra vs. HBase comparison, the former is great at supporting both the data storage part of architecture and the management. Whereas the latter’s architecture is only designed for data management, relying on other systems or technologies for storage, server status management, cache simultaneously, redundant nodes, and metadata.

2.2 Data Models

The dependent data models on which Hbase vs Cassandra works are slightly different. While it might sound the same for both the databases more or less, there are some primary differences between the two- HBase and Cassandra.

Hbase works on column families and there is a column qualifier that has one column and a number of row keys. When it comes to Cassandra query language, it also has columns just like the Hbase cell. Cassandra is also a column-oriented database.

One of the Cassandra key characteristics is that it only allows for a primary key to have multiple columns and HBase only comes with 1 column row keys and puts the responsibility of the row key design on the developers. Also, Cassandra’s primary key contains the partition key and the clustering columns in which the partition key might contain different columns.

2.3 Performance – Read and Write Operation

If it comes to performance and we are comparing Apache Cassandra and Apache HBase, then we must consider other points too. The read and write capability for both types of models is taken into account. According to a research conducted by Cloudera, here’s what they’ve found.

Write:

HBase and Cassandra on-server write paths are nearly identical. Cassandra has some advantages over HBase, such as different names for data structures and there are multiple servers for Cassandra to act and implement. The fact that HBase does not write to log and cache at the same time.

Read:

Secondly, when it comes to the option to read, Casandra is extremely fast and consistent as well, while HBase has a way to go and it is comparatively slow. Hbase is slow because it only writes into one server, and there is no facility for comparing the data versions of the various nodes. Even though Cassandra can handle a good amount of reads per second, the reads are targeted and have a high probability of being rejected.

In comparison to read and write operations, Cassandra has a winning hand.

2.4 Infrastructure

If we are talking about infrastructure then we are speaking of all the tools that play a pivotal role in maintaining high infrastructure. When we see HBase, it utilizes the Hadoop infrastructure, which includes all the moving parts such as the HBase master, Zookeeper, Name, and Data nodes.

When we see Cassandra, it comes with a variety of operations and infrastructure. In addition to the infrastructure, it employs various DBMS. Alongside this, we can find many Cassandra applications to make use of Storm or Hadoop. Furthermore, its infrastructure is built on a single node type structure.

2.5 Security

Security of the data is an important aspect for HBase as well as Cassandra. Unlike others, here all NoSQL databases have security issues. One of the main reasons for businesses to secure data is to make a performance at par so that the system doesn’t get heavy and inflexible.

However, it is safe to say that both databases have some features to ensure data security: authentication and authorization in both, and inter-node + client-to-node encryption in Cassandra. HBase, in turn, provides much-needed secure communication with the other technologies on which it is built.

2.6 Support

Access to each cell level is offered by Hbase. It majorly focuses on collaborating with administrators and taking charge of all visibility labels of sets of data. Concurrently, it will inform user groups about the labels that can be accessed at the row level. Cassandra access labels at row level and assigns responsibility and conditions.

2.7 Documentation

Documentation is an important part of any database process. For obvious reasons, it is not easy for developers. It is not as easy to learn Cassandra because documentation is not up to the mark. While in HBase, it is quite easy to learn because of better documentation.

2.8 Query Language

Both languages are JRuby-based, and the HBase shell is also no different. Cassandra as a query-based language is very specific. CQL is modeled in the same line of SQL. Compared to HBase query language, you will find more features in CQL and it is far richer in terms of functionalities.

3. Similarities Between the Two

Now that we have seen the difference between the two distributed databases, it is equally important to see what makes these two the same models. Yes, this comparison between HBase vs Cassandra query language was drawn to enlighten how they are different. Now, in the next section, we will see what makes them identical.

3.1 Database Similarity

HBase and Cassandra are both open-source NoSQL databases. Both these technologies can easily handle large data sets as well as non-relational data such as images, audio, and videos.

3.2 Flexibility

HBase and Cassandra both have high linear scalability. Users who want to handle more data can do so by increasing the number of nodes in the cluster. Since there is flexibility for both nodes, you can use any of them individually in different scenarios. The result will be the same, there won’t be any efficiency concerns.

3.3 Duplication

Both these types of models- HBase and Cassandra have robust security to prevent data loss even after the system fails. So to avoid duplication factors, there is a specific mode. Through the replication mode, this can be accomplished. Data written on one node is replicated across multiple nodes in a cluster.

3.4 Coding

Both databases are column-oriented and have similar writing paths. So, what acts as a primary source are Columns for primary storage units in a database. As users can freely add columns as per their needs. Also, the correct path begins with logging a write operation to a log file. It is primarily done to ensure durability.

4. Differentiating HBase vs Cassandra Table

Comparing Factors Hbase Cassandra
Database Foundation Google BigTable serves as the foundation for HBase. Cassandra is built on top of Amazon DynamoDB.
Model of Architecture It employs the Master-Slave Architecture Model. It employs the Active-Active Node Architecture Model.
Co-processor The capability of a coprocessor can be utilized in HBase. There is no facility for Coprocessor functionality in Cassandra
Architecture Style Hbase follows Hadoop infrastructure. Cassandra fully employs a multitude of DBMS and infrastructure for different applications.
Cluster ecosystem  HBase is not easy to set up a cluster ecosystem Cassandra cluster setup is simpler than HBase
Transactions HBase uses two methods for handling transactions:

‘Check and Put’

‘Read-Check-Delete’
Cassandra also deals with transactions in two major ways
‘Compare and Set’
‘Row-level Write Isolation’
Reads and Write Operation HBase is extremely well at intensive read functions Cassandra writes well.
Popular brands using Adobe
Yahoo
Walmart
Netflix
eBay

5. Which One is the Best of the Two?

Can you choose between your two hands that look exactly the same? Well, they are definitely not twins. Hbase and Cassandra both non-relational databases are identical yet so different from each other. Though there are similar areas, many differences are there that make each one of them unique in its own way. Like both have their pros and cons. We know that Cassandra excels at writing, while HBase excels at intensive reading. If there is something Cassandra is weak at then it is data consistency, and HBase has an upper hand in data availability. We see both attempts to eliminate the negative consequences of these issues and stand together with the positive ones.

The post Hbase vs Cassandra: Key Comparison appeared first on TatvaSoft Blog.

]]>
https://www.tatvasoft.com/blog/hbase-vs-cassandra/feed/ 0
What is Power BI? https://www.tatvasoft.com/blog/things-you-must-know-before-choosing-power-bi/ https://www.tatvasoft.com/blog/things-you-must-know-before-choosing-power-bi/#comments Thu, 26 Nov 2020 09:08:04 +0000 https://www.tatvasoft.com/blog/?p=3765 Every information you access is data or even vice-versa is true. The business world is generating data in leaps and bounds, and there is no limit to it. Plus, in the data-driven world, every action you perform ends up creating a record, and modern business is overwhelmed with data.

The post What is Power BI? appeared first on TatvaSoft Blog.

]]>
Every information you access is data or even vice-versa is true. The business world is generating data in leaps and bounds, and there is no limit to it. Plus, in the data-driven world, every action you perform ends up creating a record, and modern business is overwhelmed with data. So what is that one way to manage this humongous amount of data? If you are thinking of getting actionable insights by evaluating the dataset, then you might need a tool for it. Ever heard of what is Power BI ?

Definitely, there are very few people who aren’t introduced to this powerful tool. Power BI is a SaaS-based data collective business intelligence platform powered by Microsoft. Since the inducement of Power BI in 2014, Microsoft has provided regular updates and made it more useful for Power BI customers. Power BI works on converting your raw and scattered data into interactive visualizations in one place that helps users in making business decisions. Hundreds of certified connectors are provided in the Power BI platform to connect data from different systems and even from files.

With this post, our SharePoint Developers have explored some of the most influential aspects of Microsoft Power BI before we start using it. Let’s know deeper about what is Power BI?

1. Introduction to Power BI

Often what businesses assume is Power BI is just a business intelligence tool. Well, that’s just partly correct because it is much more than that. Microsoft Power BI is actually a data visualization tool that converts data from varied sources into visually interesting and interactive dashboards. 

Define Power BI – is an intelligent business intelligence tool that can be used for cloud-based apps or other organizations to collate, manage and analyze data from different sources in a convenient way. A business intelligence platform powered with excessive information helps businesses make the right decision at the right time.

The operability of the Power BI site is very simple. It extracts the data from multiple sources puts them together and intelligently gets converted into visually compelling data. This data can be used to make informed data-driven decisions. It consumes information on Power BI reports in the form of graphs, charts, snapshots, and other multimedia formats. 

If we were to define some specific sources of Power BI then it would include countless data sources like Excel spreadsheet files, word docs databases, and other information that’s available on-premise data sources or in the cloud. 

The answer is- Power BI is all of this. It is a Windows-based desktop application also called Power BI desktop, and it is also an online SaaS-based app that can be conveniently accessed from all locations also called Power BI service. It is also versioned as a mobile app for Windows, iOS, or Android-based phones or tablets. These all different versions of Power BI like SaaS, Desktop, and mobile Power BI apps are used on different platforms. 

Let’s see how each one of these Power BI tools works distinctly.

2. Power BI Desktop

Microsoft Power BI Desktop is a free data visualization tool that you install on Windows computers to create reports and visualizations for yourself. It supports different databases and systems to extract business data. It is an extremely handy tool for data scientists and developers to transform your information into meaningful visualizations. Power BI Desktop also offers data warehouse capabilities along with data discovery and preparation. You can prepare reports using this tool but sharing of information is not possible within the tool.

Power BI Users can:

  1. Connect, transform, and model data.
  2. Connect multiple sources into one dataset and generate reports.
  3. A vast variety of visualizations are provided to add to your Power BI report. Additionally, you can install custom visualizations from the market or develop customized Power BI visuals for yourself.
  4. Choose a color theme from existing templates and also create a new one with your own set of colors.
  5. Setup rules for row-level security.
  6. Power query editor and DAX queries to shape your data.
  7. Performance check for visualizations.
  8. Support Python.
  9. Publish reports in Power BI service.

3. Power BI Service

Microsoft Power BI is a cloud service platform that allows users to create interactive dashboards and share the developed reports. It supports light report editing and has a prime focus on the collaboration of teams and organizations. Most of the features of the Power BI desktop are supported in Power BI services too. Power BI service tools, let users share their own reports and among other users who can access these reports and Power BI dashboard using a Power BI website or Power BI mobile apps for Windows, Android, and Apple.

Power BI service has below extensive features to explore:

  • Share dashboards and reports with other Power BI users.
  • Users can subscribe and set alerts on reports and the Power BI dashboard.
  • Setup permissions – who can view and edit your reports.
  • Manage row-level security.
  • Create different workspaces.
  • Create Power BI apps to securely share dashboards and reports of business intelligence to other users in teams.
  • Analyze data in excel.
  • Securely share reports in other systems via embed API.
  • Schedule data refresh and set up on-premise gateways for legacy data.

Few things you could not achieve using only Power BI Service and hence you need Power BI Desktop:

  • Add/edit rules for row-level security.
  • Create calculated columns.
  • Advanced query editor.
  • Python and DAX support.
  • Data transformation and modeling.

4. Power BI Report Server

Power BI Report Server is known for its on-premises services that it offers as a report server with a web portal that enables the users to display and manage reports and KPIs. Power BI report server comes with paginated reports, Power BI reports KPIs, and mobile reports. This makes the end-user of any business site access reports in various different ways like getting an email of it in their inbox or viewing them on a mobile device or web browser.

Power BI Report Server offers some extensive features like –

  • Invert and continuous axis sorting.
  • CALCULATE filter
  • Smart guides for object alignment.
  • CROSSFILTER to support different relationships
  • Visual Zoom Slider
  • ArcGIS to support Power BI

5. Power BI Mobile

Power BI Mobile enables the users to stay connected to their business data from anywhere and at any time. This means that mobile BI is just a touch away. It enables every business owner and its employees to monitor the business from the phone and help them access on-premises data that is stored in the data cloud or SQL server. Power BI Mobile applications offer a 360-degree view of all the information.

Power BI Mobile offers some extensive features like –

  • Flexible and secure mobile access
  • Push notifications for data alerts
  • Easily annotate reports with a single touch
  • Reports for mobile users with a live dashboard 

6. Power BI embedded

Power BI embedded analytics is something that enables the users to embed their Power BI content like dashboards, reports, and tiles, in a website or web application. It helps in offering compelling data experiences to the end-users of your business and enables them to act as per the insights they get from the data. It also offers exceptional customer-facing reports, analytics, and dashboards in the business app. Besides this, Power BI Embedded is helpful in reducing the developer resources as it automates everything from app monitoring to deployment of analytics.

Power BI embedded offers some extensive features like – 

  • It offers hourly services without any usage limit.
  • It is a cost-effective solution for businesses that want to have powerful business intelligence.
  • Helps in merging its capabilities with Power BI Viewer.
  • Facilitates implementation of data governance aspects.

7. Power BI Benefits

Power BI is Secure

Power BI comes with various security features that help business owners to protect important and sensitive data. It also enables businesses to meet security and compliance standards. For instance, with the use of Microsoft’s Cloud App Security feature, Power BI offers sophisticated analytics to combat cyber threats. Besides this, sensitivity labels of Power BI make it very easy for admins to alert customers about what data is sensitive. 

Power BI Offers Business Intelligence for All

Power BI is a platform that empowers different types of organizations to create data-driven cultures. This means that the business decisions are made as per the data or information companies have. It results in organizations accomplishing difficult tasks with the help of business intelligence assets. Basically, Power BI enables companies to create an effect where all the employees of the firm can make decisions according to trustworthy and real-time data.

Power BI Easily Connects With Data Sources

Power BI enables the connection of myriad data sources and this includes everything from file data sources such as CSV and Excel to database sources like Snowflake and Oracle database to online data sources such as Adobe Analytics and Salesforce.

Power BI is Improving Everyday

With each passing day, Microsoft is pouring money and time to improve Power BI and this shows its dedication to making it the best data analytics platform in the world. Every now and then new features are added to this tool and the existing ones are improved and tweaked. 

Power BI has Artificial Intelligence Capabilities

Power BI comes with knowledge of artificial intelligence that enables the users to get valuable information, data, and reporting. Besides, it also provides three powerful AI visualisations that are useful to software developers when they need to dive deep into important data and generate insights.

8. Power BI Features

Datasets Filtration

Dataset means the set of data that are created to gather data from different data sources. The developers use datasets to create different kinds of visualisations. The dataset can be created by gathering data from a single source such as an Excel workbook. Then one can filter the datasets and can create smaller subsets that hold important information. In this case, Power BI offers a wide range of in-build connectors like Oracle, Facebook, Excel, Salesforce, SQL database, and more to the users who can easily use these sources and create datasets.

Flexible Tiles

Tile means a single block that comes with Power BI dashboard visualization. Tiles are generally used to separate each informative visualization and this provides a clearer view of data. These tiles are adjustable and can be placed anywhere in the Power BI dashboard according to the user’s convenience.

Informative Reports

In Power BI, reports are a combination of different types of visualization on dashboards that are relevant to specific business topics. A report displays a structured presentation of the business data and also reveals insights from it. This helps the users to easily understand the graph of the business and it can also be shared with other users or employees of the firm.

9. Components of Power BI

Power Query

Power Query is a component for data transformation. It enables the developers to find, connect, and combine the data sources to meet the required needs.  Business analysts use this to transform, integrate and enhance big data into Power BI web service.

Power View

Power View is available in SharePoint, Excel, SQL Server, and Power BI. This technology helps in creating interactive graphs, charts, maps, and more

Power Pivot

Power Pivot is a component that follows a data modeling technique to help users develop data models. It uses Data Analysis Expression (DAX) language for modeling both simple and complex data.

Power BI Desktop

Power BI Desktop is a tool for Power Pivot, Power Query, and Power View. It enables us to have all information under one system.

Power Map

Power Map is used for Power BI and Excel. It is a 3-D data visualization tool that allows the users to map the business data and plot millions of rows to visualize data on Bing Maps.

10. Power BI Connectivity Types

Power BI provides majorly four types of connectivity depending on the data sources connector you are using.

  1. Import
  2. DirectQuery
  3. LiveConnection
  4. Composite (Mixed-mode)

These different connectivity types have their benefits and limitations. You should check all types to choose the best suitable options.

Import

Import is the most important yet common feature of Power BI where almost all data sources can be used. It is extremely important for businesses to understand all the data that is to be imported and can be stored in a PBIX file. Storing data within the file and in-memory increases the performance for retrieving, querying, and loading reports and thus makes this connectivity type the fastest among all types. If the tables are large then this application is not applicable. Since the performance is dependent on the memory and machine processor, you will feel slowness during development while using the Power BI desktop tool on a local computer with a large amount of data.

Import connectivity type provides you to use the full capabilities of the Power BI desktop tool. DAX functions are fully supported by this. Also, it stores all data in memory. Power BI lets you have the benefit of using the full capacity of datasets as permitted by licenses of Microsoft Power BI.

When you are using Import type, you will have Reporting, Data, and Modelling – 3 option tabs displayed in the Power BI Desktop tool.

Import

DirectQuery

The next in line is the Direct query. As the name suggests, it directly fires queries of data from the source. You cannot store any data in the Power BI report file in the Import connectivity type. DirectQuery connectivity type is available only with relational database sources. The Power BI allows the storage of only the metadata of the source i.e.: table name, field names, relationships, etc in the Power BI file, except for actual data. This gives a major benefit when working with large data tables to achieve 1 GB restrictions of dataset size.

When interacting with Power BI reports, it requests data as per the applied filters into the data sources using the details stored in the file. Also, when data often get updates, you will get nearly real-time data with this connection type. Performance may decrease since the firing of queries is in real-time and no data is available in the file. But there are several techniques provided to minimize the queries to the source.

Since the data storage is not in the Power BI file, it has certain limitations in the capabilities of the Power BI desktop. Many DAX operations did not support this till last year. There are restrictions to some transformations in Power Query like changing column data types, splitting columns, removing duplicates, etc. Also, the returning of data in DirectQuery has limitations to 1 million rows, unless you have a Power BI premium license.

When you are using the DirectQuery type, you will have Reporting and Modelling – 2 option tabs displayed in the Power BI Desktop tool.

LiveConnection

You cannot store data in the Power BI file for the LiveConnection connectivity type. All the data is queried using LiveConnection in the existing analysis Services model to interact with reports. You can only use Azure Analysis Service, SQL Server Analysis Service, and Power BI datasets in Power BI services within this connection type. Since these sources are analytical services, performance for querying is much better than DirectQuery. Generally, we use Power BI for live connections in an enterprise deployment.

As the data is not stored in the Power BI file, the majority of people get confused between DirectQuery and LiveConnection types, but both are so much different and cannot be used on behalf of each other.  Since Analytical Services works on data, you will not have much freedom in data transformation and authoring. Only report-level DAX measures are available to add. These measures are in the Power BI file but cannot make changes in the Analysis Service model data.

You will have all reporting capabilities when you use LiveConnection and thus only the Reports option tab is in display mode in the Power BI desktop tool.

Composite (Mixed-mode)

In the past, Developers couldn’t create connections to multiple sources using Import and DirectQuery in a single Power BI report. But now it’s possible using the Composite connection type. You can include one table from the SQL server using DirectQuery type and another as an Import connection type in the same report. This way you can include a small amount of data in the Power BI file and connect large tables with DirectQuery type.

You have seen the variations in connection types to connect data sources in Power BI. This variation gives you the power to accommodate different types and sizes of data sources at your ease.

Let’s see licensing info for Power BI cost, luckily they have only 2 variations in that:

11. Licensing Information

Power BI service has 2 licenses: Power BI Pro license – to get started with and Power BI Premium license – advanced data analytics, big data support, and dedicated cloud compute.

Read More about Power BI + Google Analytics = Power Analytics

Have a look at the below comparisons before choosing licenses for your use:

FeaturesPower BI ProPower BI Premium
Pricing10 $ per month per user4995 $ instance per month
Included with office 365 Enterprise E5YesNo
Dedicated cloud compute and storage resourcesNoYes
On-premise reporting through Power BI report serverNoYes
Compute Processing environmentSharedDedicated
Content deployment in multiple regionsNoYes
Incremental data refreshYesYes
Data refreshes per day848
Allocate compute resourcesNoYes
Monitor performance for compute resourcesNoYes
Maximum size of individual dataset1 GB10 GB
Maximum storage10 GB Per User100 GB
Data security encryptionYesYes

You may notice the supported dataset size that Power BI supports in the above table. Perhaps you feel that Power BI is not useful when you have large data to serve. But this will surprise you to know that Microsoft uses some data compression techniques called VertiPaq storage engine to minimize the size of data after importing.

VertiPaq storage engine compresses the data and reduces the size up to 10x smaller. Although there is no specific equation that defines it . It depends on how you have managed the architecture of your raw data. Usually, we see a compression in data from 10 GB of source data and made of 1 GB in size in Power BI reports.

Despite the VertiPaq storage engine compressing the data, it is important that you only load the required data in Power BI to minimize the data source size. Because the gradual data refreshes in the report will increase the data size. Finally, this will directly or indirectly have an effect on the performance of reports and visualization.

12. Data Reduction Techniques

There are 8 different ideas as per the suggestion of leading company  Microsoft to reduce the data size

  • Remove unnecessary columns
    Microsoft recommends that you include only those columns in the model that are mandatory for reports. Your requirements may change over time, but you should be aware that data modeling and including new columns in models is easy.
  • Remove unnecessary rows
    Microsoft recommends that you include only a few rows in the model that are mandatory in reports. By careful observations, you can set filters and allow only required rows in reports, which reduces the report size and also increases report performance. For example, you only need current year sales data for your report then instead of including all year sales data, you can filter and include only this year’s sales data rows in reports.
  • Group by and summarize
    These charts and visualizations need summarized data, and SharePoint developers normally pull all data first in Microsoft Power BI and then shape them accordingly. Instead, you can pre-summarize data in Power BI that will reduce its size, and after importing to Power BI again, with a reduction in the dataset. You can shrink the dataset size which is one of the effective ways to eliminate multiple rows and columns.
  • Optimize Column data types
    VertiPaq storage engine has different methods for each column for compression. There is compression seen in numeric columns with high margins and perhaps decreases the size of the dataset. So, it’s advisable to check and set proper data types for columns in the table after importing the data. For example, if you have one column “Lead Number” with alpha-numeric values, like “L00001, L00002, L00003…”. Power BI will detect this as a text column due to alpha-numeric values. Since the prefix of the numbers is already fixed in this column, you can remove the prefix and the column converts into number type. For the large tables, this minor change will give a huge effect on data compression and thus data reduction.
  • Preference for custom columns
    Power BI provides facilities to create custom columns in tables. The VertiPaq storage engine stores custom columns just like Power Query sourced columns. These columns are less effective during data reduction and take more time during data refreshes each time. Also, it’s advisable to add custom columns via Power Query editor instead of using direct Dax queries on models because Dax query custom columns are built once all Power query tables are refreshed and it increases the refresh time. However, if you create these calculated columns in the SQL server or any other systems before importing into Power BI will reduce the calculation efforts inside the engine and increase the performance.
  • Disable Power Query Load
    By default, Power Query is enabled in Power BI that fires a report to integrate data between different systems. But this has to be performed within the same report. To avoid the loading of the query, you can disable it in the query editor as shown in the below image

  • Power Query Load
  • Disable auto date/time
    Power BI Desktop has one option called “Auto Date/Time” which creates new data columns for storing data. These fragments of instances are displayed in the year, month, and day format for better filter options available. But when we use this for large tables in reports, these new columns increase the size of the dataset. You can disable this option for date-time columns on which you don’t need such filters.
  • Switch to Mixed mode
    If you want to determine the storage of each table, the Power BI desktop application can help. It enables you to easily bifurcate and create space in the storage. It suggests using composite connectivity – mixed mode to get the data. Either you can import data in the Power BI dataset, or you can use direct queries in the source system. This will help you arrange data as per your need. The direct query option is very useful when there are large size tables and data needs constant updates. This mixed mode will be helpful for summarized data. It will help you tap on the direct query option and other data can reside on the dataset.

13. Final words

Our modern business world is continuously plunging in data. And as we know that every move we take, builds a new record of data. A Power BI is a Business Intelligence magic tool that benefits users in understanding how data works and makes business processes more productive. It also gives clarity on how eliminating unnecessary storage space using Power BI can make tasks efficient. We hope this is an insightful blog, that shows a true and practical sense of what is Power BI and how to use Power BI to manage your data strategically.

More Related Blog Post
BI Tools – Microsoft Power BI vs. Google Data Studio

The post What is Power BI? appeared first on TatvaSoft Blog.

]]>
https://www.tatvasoft.com/blog/things-you-must-know-before-choosing-power-bi/feed/ 4
Data Analytics with Elasticsearch, Logstash and Kibana https://www.tatvasoft.com/blog/data-analytics-elasticsearch-logstash-kibana-elk/ https://www.tatvasoft.com/blog/data-analytics-elasticsearch-logstash-kibana-elk/#respond Thu, 21 Jun 2018 06:55:30 +0000 https://www.tatvasoft.com/blog/?p=1969 ELK is mostly used in log analysis and end to end Big Data analytics. This is a mini tutorial on setting up ELK stack so that you can implement the solution on top of it.

The post Data Analytics with Elasticsearch, Logstash and Kibana appeared first on TatvaSoft Blog.

]]>
ELK stack, scales nicely and works together seamlessly, is a combination of three open source projects –

  • Elasticsearch: founded in 2012, commercially supported open-source, built on top of Lucene, uses JSON and has rich API
  • Logstash: it’s there since 2009, as a method to stash logs
  • Kibana: it’s around since 2011, to visualize event data

ELK is mostly used in log analysis and end to end Big Data analytics. This is a mini tutorial on setting up ELK stack so that you can implement the solution on top of it.

ELK Stack Installation Steps

  1. Go to its official website https://www.elastic.co/downloads and download below products in a separate directoryELK Stack Installation steps
  2. Extract all the three downloads. Here in this tutorial we are using windows10 as a host or OS.
  3. To start Elasticsearch
    • Go to the <<Elasticsearch>>/bin and run elasticsearch.bat as an administrator.
    • After starting Elasticsearch server check http://localhost:9200 in browser to confirm the startup.
  4. To start Kibana
    • Go to the <<Kibana>>/bin and run kibana.bat as an administrator.
    • After Kibana server is started check http://localhost:5601 in web browser.
  5. To start Logstash
    • Go to the bin directory of Logstash and open command prompt as an administrator
      logstash -e 'input { stdin { } } output { stdout {} }'
    • When the main pipeline starts (“Pipeline main started”), type any message in the command prompt.
    • If everything is working seamlessly, Logstash will return your message with appended timestamp and IP.

Architectural Description of ELK Stack

Architectural Description of ELK Stack

As we can see in the above architecture, Logstash collects the raw data from various sources like HDFS, logs (system logs, HTTP logs, proxy logs etc.), Twitter streams, MySQL, etc and sends for further processes. Let’s try to nibble every component from this ELK stack and

1. Elasticsearch

Elasticsearch is a highly scalable real-time distributed search engine, which is mostly used for analysing and indexing the data.

  • It uses Lucene engine for fast searching and indexing.
  • It uses full text based searching.
  • Elasticsearch is an unstructured database which stores the data in the documents.
  • Elasticsearch runs in cluster mode and data is distributed on every node.
    Elasticsearch RDBMS
    Index Database
    Shard Shard
    Mapping Table
    Field Field
    JSON Object Tuple
  • Comparison between Relational database and Elasticsearch
  • “Index” in Elasticsearch is a collection of different type of documents and document properties. When data is pushed to the Elasticsearch, the data is arranged in indexes of Lucene, then Elasticsearch uses the Lucene indexes to read/write operations.
  • To create Index, raise a PUT request http://localhost:9200/index_name

You can search your data with http://localhost:9200/index_name/_search? As shown in below screenshotElasticsearch

2. Logstash

As shown in the above architectural diagram

  • Logstash collects logs and events from various sources like HDFS, MySql, logs (system logs, application logs, network logs), twitter etc and.
  • It transforms the data and sends to the Elasticsearch database.
  • At the same time Logstash uses a number of inputs, filters and output plugins. It transforms the raw data based on specified filters in its configuration file.
  • Here is an example of Logstash configuration fileLogstash configuration file
  • Above file contains the information of input location, output location and the filter (This needs to be applied to the processed data.)

The following command will help you to start Logstash with configuration fileCommand Logstash configuration fileAs shown above, Logstash has started the pipeline between Elasticsearch and Logstash and then parsing the data to Elasticsearch has started. If we want to visualize the data, we will use Kibana, the visualization tool.

3. Kibana

Kibana is an opensource visualization tool which provides a beautiful web interface to visualize the Elasticsearch data.

  • Kibana allows us to create real-time dashboards in browser based interfaces.
  • Kibana has different visualization effects like bar charts, graphs, pie charts, maps, tables etc.
  • It allows to save, edit, delete and share the dashboards.
  • After starting Kibana.bat file open http://localhost:5601 in browser and go to Management View like in the below screenshotManagement View
  • From the above picture select your “Index_name” and move ahead to work on that Index.
  • Discover option will allow you to see the data as shown in the below screenshotDiscover option
  • Dashboard option will allow you to create your own dashboard which can have multiple visuals as in the below screenshotDashboard

Kibana “DevTool” option helps you to interact with elasticsearch data. For example, if I want to search records of my Index. , we can see how it works belowDevTool

4. Elasticsearch-Hadoop

Elasticsearch-Hadoop

Use Cases or Examples of ELK Implementations

  1. DELL – Powering the Search to Put the Customer First.
  2. Facebook– Delivering a better help experience for over a billion users
  3. Microsoft– Providing search on Azure and powering Social Dynamics
  4. IBM– Providing the operational log analysis engine for Bluemix Apps
  5. Salesforce– Empowering businesses with log analysis for usage trends
  6. Accenture– Powering the search for the best client service
  7. Sprint– Analyzing 200 dashboards to search for better retail operations insight
  8. Symantec– Successfully switched from Solr to Elasticsearch with Elastic Support
  9. SunHotels– Scaling anomaly detection across 1000+ bookings a day with Elastic machine learning
  10. BBC– Unlocking yesterday’s content for the future of media search

TatvaSoft being a Software Development Company and working over the time with various projects have a deal with the Big Data Analytics services and consultancy for the clients from various industries. We even conveyed a project to the Media & Entertainment Industry using Elastic Search functionality for boosting up the purpose and process.

To know more about the project performed – Digital Distribution Platform

The post Data Analytics with Elasticsearch, Logstash and Kibana appeared first on TatvaSoft Blog.

]]>
https://www.tatvasoft.com/blog/data-analytics-elasticsearch-logstash-kibana-elk/feed/ 0
Brief Look on Apache HBase https://www.tatvasoft.com/blog/brief-look-apache-hbase/ https://www.tatvasoft.com/blog/brief-look-apache-hbase/#respond Tue, 26 Dec 2017 06:52:24 +0000 https://www.tatvasoft.com/blog/?p=1966 Like Google's Bigtable and in hard competition to it, Apache HBase is an open-source, non-relational, scalable, and distributed database developed as part of Apache Software Foundation's Apache Hadoop project. It operates on top of HDFS (Hadoop Distributed File System) in the architectural structure, which has Bigtable capabilities equivalent to Hadoop.

The post Brief Look on Apache HBase appeared first on TatvaSoft Blog.

]]>
Like Google’s Bigtable and in hard competition to it, Apache HBase is an open-source, non-relational, scalable, and distributed database developed as part of Apache Software Foundation’s Apache Hadoop project. It operates on top of HDFS (Hadoop Distributed File System) in the architectural structure, which has Bigtable capabilities equivalent to Hadoop.

Apache HBase

Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire dataset even for the simplest of jobs. Over others, purposes to use HBase are Data Volume (in petabytes data format), Application Types (variable schema with somewhat different rows), Hardware environment (running on top of HDFS with larger number of nodes (5 or more)), No requirement of RDMS (without features like transaction, triggers, complex query, complex joins, etc.) and Quick access to data (only if random and real-time access to data is required).In complex systems of BigData analysis, HBase and Hive – important Hadoop based technologies can be used in conjuction also, for further extended features to reduce the complexity.

apache-hadoop-eco

Background

  • Apache HBase is the current top level Apache project which was initiated by the company in the name of “Powerset”. The process was to process a large number of data and make it compatible to natural language search.
  • Facebook even implemented its new messaging platform in Apache HBase.
  • The 1.2.x series is considered to be stable release line. (as of February 2017)

Data can be stored to HDFS either directly or via HBase. Data consumer reads or accesses the data in HDFS randomly with HBase. HBase stays on top of the Hadoop File System and provides both read and write access.

Apache hbase background

HBase vs HDFS

hbase-vs-hdfs

Storage Mechanism

HBase is a column-oriented database. The tables in HBase are sorted by row. The table schema represents only column families, which are commonly called the key-value pairs. A table has several column families and each column family possesses multiple columns. Succeeding column values are stored constantly on the disk. Furthermore, each cell value of the table has a timestamp.

  • Table is a collection of rows
  • Row is a collection of column families
  • Column family is a collection of columns
  • Column is a collection of key value pairs

Features:

  • Linearly scalable
  • Automatic Failure Support
  • Consistent Read and Write facility
  • Integrates with Hadoop, both as Source and Destination
  • Caters easy Java API for client
  • Data replication across clusters

Architecture:

In HBase, tables are divided into smaller regions and are assisted by the region servers. Each region is further vertically partitioned by column families into parts, commonly known as Stores. Each store is saved as a file in HDFS. Below diagram is the architecture of HBase:

HBase Architecture

Note: The term ‘store’ is used for regions to explain the storage structure.

Setting up Runtime Environment

Following are the pre-requisites for HBase –

  • Create a separate Hadoop user (Recommended)
  • Setup SSH
  • Java
  • Hadoop
  • Configuring Hadoop
  • core-site.xml – Adding host & port (HDFS URL), total memory allocation of file system, size of read/write buffer
  • hdfs-site.xml – Should contain values of replication data, namenode path, datanode path, etc
  • yarn-site.xml – Useful to configure yarn in Hadoop
  • mapred-site.xml – Used to specify which MapReduce framework
  • Installing HBase

We recommend to use HDP for learning purpose as it has all the pre-requisites already installed. We are using HDP v2.6 for demo purpose.

Java API

To communicate Java API is provided which apparently is the fastest way to deal with. All DLL operations are mainly facilitated by HBaseAdmin. Sample code to receive HBaseAdmin instance is mentioned below:

Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "<server_ip>:2181");
conf.set("zookeeper.znode.parent", "/hbase-unsecure");
HBaseAdmin admin = new HBaseAdminconf);</server_ip>

Note: Mentioned HDP is running on IP 192.168.23.101 and port 2181 and is connecting it from local system.

DDL Commands

Create Table:

HTableDescriptor tableDescriptor = new HTableDescriptor("user");
    // Add column families to table descriptor
    tableDescriptor.addFamily(new HColumnDescriptor("id"));
    tableDescriptor.addFamily(new HColumnDescriptor("username"));
    //Execute the table using admin object
    admin.createTable(tableDescriptor);

Alter Table:

HColumnDescriptor columnDescriptor = new HColumnDescriptor("emailId");
admin.addColumn("employee", columnDescriptor);

Disable Table:

admin.disableTable("user");

Delete Table:

admin.deleteTable("user");

List Table:

String str[] = admin.getTableNames();

DML Commands

For DML operations, need to use HTable class. Following is the code snippet for the same. Don’t forget to close HTable after finishing.

// instantiating HTable class
HTable hTable = new HTable(conf, "user");
hTable.close();

Insert/Update Data:

// instantiating Put class
Put put = new Put(Bytes.toBytes("myRow"));
 
// adding/updating values using add() method
put.add(Bytes.toBytes("personal"),
Bytes.toBytes("name"),Bytes.toBytes("john"));
 
put.add(Bytes.toBytes("personal"),
Bytes.toBytes("city"),Bytes.toBytes("Boston"));
 
put.add(Bytes.toBytes("professional"),Bytes.toBytes("designation"),
Bytes.toBytes("APM"));
 
put.add(Bytes.toBytes("professional"),Bytes.toBytes("salary"),
Bytes.toBytes("50000"));
 
// saving the put Instance to the HTable
hTable.put(put);

Read Data:

Get get = new Get(Bytes.toBytes("row1"));
// fetching the data
Result result = table.get(get);
// reading the object
String name = Bytes.toString( result.getValue( Bytes.toBytes("personal"), Bytes.toBytes("name")));
String city = Bytes.toString( result.getValue( Bytes.toBytes("personal"),Bytes.toBytes("city")));

Delete Data:

Delete delete = new Delete(Bytes.toBytes("row1"));
delete.deleteColumn(Bytes.toBytes("personal"), Bytes.toBytes("city"));
delete.deleteFamily(Bytes.toBytes("professional"));
 
// deleting the data
table.delete(delete);

Security:

  • Grant Permission: It grants specific rights such as read, write, execute, and admin on a table to a certain authorized users. The syntax of grant command is as follows:
    grant <user> <permissions> [<table> [<column family> [<column; qualifier>]]
  • Revoke Permission: The revoke command is used to revoke a user’s access rights of a table. Its syntax is as follows:
    revoke <user>
  • Check Permission: The “user_permission” is used to list all the permissions for a particular created task.

Why to choose HBase?

Looking to the business perspectives for the acceptance of Apache HBase and using it as a prominent solution for fetching out and synchronizing data, it is wiser to look into the benefits of it and then select it, in context of others.

Pros

  • Built-in versioning
  • Strong consistency at the record level
  • Provides RDBMS-like triggers and stored procedures through co-processors
  • Built on tried-and-true Hadoop technologies
  • Active development community

Cons

  • Lacks with a friendly, SQL-like query language
  • To setup beyond a Single-Node development cluster is not easy

Apart from, Scalability, sharding, Distributed storage, Consistency, Failover support, API support, MapReduce support, Backup support and Real time processing are the core features that make it unique from others. In a nutshell, it surely revolutionize the existing system to synchronize the structured and unstructured data.

The post Brief Look on Apache HBase appeared first on TatvaSoft Blog.

]]>
https://www.tatvasoft.com/blog/brief-look-apache-hbase/feed/ 0
Leveraging Big Data Analytics to Revolutionize Sports https://www.tatvasoft.com/blog/leveraging-big-data-analytics-revolutionize-sports/ https://www.tatvasoft.com/blog/leveraging-big-data-analytics-revolutionize-sports/#respond Mon, 17 Oct 2016 07:19:41 +0000 https://www.tatvasoft.com/blog/?p=2000 Sports is all about players, team, practice, performance, and strategies. Today, Big Data has touched several industries and its emerging trends has gained popularity that helps in making the right decision to any industry. Expanding its usage, the sports industry is also leveraging the big data to gather the analyzed data to build strategies and improve the game at all levels.

The post Leveraging Big Data Analytics to Revolutionize Sports appeared first on TatvaSoft Blog.

]]>
Sports is all about players, team, practice, performance, and strategies. Today, Big Data has touched several industries and its emerging trends has gained popularity that helps in making the right decision to any industry. Expanding its usage, the sports industry is also leveraging the big data to gather the analyzed data to build strategies and improve the game at all levels. In our earlier blog we have already discussed Big Data Competitive Advantage in Retail, today we will focus on how big data have become trending buzz word in sports.

Statistics says, in America alone, 80% of NBA teams and 97% of MLB teams have employed analytics professionals.

Engage User with Push notification

Since the inception of big data in sports, data scientists have started using this latest data-driven technology to utilize its power and gather the large volume of data to analyze and build game theories. Because of the vast data collected from multiple channels, the sports industry has touched new heights in terms of improving players’ efficiencies and performance.

Big Data is everywhere. Starting from a wearable device to track players’ health, calorie intake to decide playing strategies and mobile app to track fans’ experience are few examples of big data.

How Big Data Made Sports Simpler?

Wearable Technologies Data

Wearable devices are flooded in the market, giving rise to fitness awareness among individuals. It has also found its acceptance in the sports industry and helped in monitoring and tracking data of players. The device is either worn by the player or attached to their clothes to let the coach know about the player performance and fitness variables like heartbeats, speed, and acceleration.

Such type of collected data produces vital information of players to coaches that creates an ideal training plan for the player. Data collection improves player’s security and makes the sports safer.

Live Data Collection

Manual data collection of the field is common while playing games. However, collecting live data is challenging as the data moves very fast that do not cater exact data record at the precise time. Henceforth, companies attach RFID tags on players and sports equipment to track speed, distance, and time.

Support Better Judgement

Since in sports, every action happens fast, so it creates a challenge for referees to take a correct decision. Chances are referees takes a wrong decision which impacts the entire game, therefore having big data and analytics, sports industry can use devices to track the strike or a ball hit. This helps the sports authorities to make better decisions using the analytical data.

Predicting a Fan Preference

Technology is now everywhere and so used by the ticket vendors to understand what the fans want. Here, big data performs predictive analysis to allow the vendors know which time is suitable for sports fans or if they are interested in visiting a particular sports stadium. Major sports teams use the mobile app to predict their fans need and provide with special fan experience.

Training to Improve Game

Golf is a highly sophisticated game and it needs good precision. To help training and improving the game, a golf simulator system is designed with a set of hardware (sensors + tracking devices) and software which enables to capture, display and analysis data based on the movements of a golf player, which helps in training. Based on the data received, the software generates animation along with a synthesized biomechanical report which helps identify the strengths/weaknesses of a player and also suggest on how to play to improve the game!

Influence Coach Decision

A rapid increase in the use of the tools available optimizes the coach decision concerning the calculations and considerations by tracking the real time player’s performance in sports. Big Data widens opportunities in this competitive edge to manage the data to provide the right decision. For example, a coach stand outside the court can study about the players playing orders and health using a wearable device that sends real-time data. Employing such a device helps the coach to make a right call.

How are Business Goals Met with Big Data?

After learning several aspects of Big Data in sports, how the business goals can be met?  The foremost objective is to identify the performance of the players as a high-performance or low-performance. Through this, one would get an insight of the players as high, average or low performers that can be trained to achieve success, taking coaching decisions and executing a plan. Particularly, industries have been utilizing the big data to implement in defining strategies. The data in sports are facilitated by implementing a proper tool designed by a professional Big Data and Analytics company.

Big Data is an opportunity in sports to enhance players’ professionalism and defining an unbeatable strategy to meet the expectations!

The post Leveraging Big Data Analytics to Revolutionize Sports appeared first on TatvaSoft Blog.

]]>
https://www.tatvasoft.com/blog/leveraging-big-data-analytics-revolutionize-sports/feed/ 0
Metadata Injection Using Pentaho https://www.tatvasoft.com/blog/metadata-injection-using-pentaho/ https://www.tatvasoft.com/blog/metadata-injection-using-pentaho/#respond Thu, 21 Jul 2016 12:39:14 +0000 https://www.tatvasoft.com/blog/?p=1984 Metadata injection facilitates user to define the metadata at run time. E.g. defining a mapping of excel columns to fields at run time based on various parameters.

The post Metadata Injection Using Pentaho appeared first on TatvaSoft Blog.

]]>
Metadata injection facilitates user to define the metadata at run time. E.g. defining a mapping of excel columns to fields at run time based on various parameters.

Pentaho’s most popular tool, Pentaho Data Integration, PDI (aka kettle) gives us a step, ETL Metadata Injection, which is capable of inserting metadata into a template transformation. So instead of statically entering ETL metadata in a step dialog, you can pass it dynamically. This feature certainly plays an instrumental role in solving repetitive ETL workloads like loading of text files, data migration and so on. (Please refer our earlier blog for more details about ETL Process.)

Metadata injection inserts data from various sources into your transformation at runtime. This insertion reduces repetitive ETL tasks for various input and output files.

For example, you might have a simple transformation to load transaction data values from a supplier’s spreadsheet, filter out specific values to examine, and output them to a text file.

You need to develop a transformation for the main repetitive process, which is often known as the template transform.

ETL Metadata injection

For this example, you need a transformation (process_supplier_file) to process the transactions in each supplier’s file. Then, the metadata needs to be injected from a transformation (inject_supplier_metadata) developed with the ETL Metadata Injection step. The ETL Metadata Injection step calls the template transformation. Since this example is for inserting data from multiple files, the metadata injection transformation needs to be called from another transformation (process_all_suppliers) per each supplier file.

So overall, we will develop three transformations.

Template Transformation

Template Transformation – The main repetitive transformation for processing the data per each supplier’s spreadsheet.

With metadata injection, you develop your repetitive, template transformation as you would normally. The main difference is how the settings for each step pertains to the metadata injection, instead of data values of a single specific source.

Process_supplier_file:

Template Transformation

Metadata Injection Transformation

Metadata Injection Transformation – The transformation defining the structure of the metadata and how it is injected into the main transformation.

For this example, our metadata values are maintained in separate spreadsheet files. You need to create a transformation to extract in these values, prepare them for the injection, and then insert them into the template transformation through the ETL Metadata Injection step, as shown in the following figure:

Inject_supplier_metadata:

ETL Metadata injection

Transformation for All Suppliers:

Transformation for All Suppliers – The transformation going through all the suppliers’ spreadsheets, calling the metadata injection transformation per each supplier, and logging the entire process.

Since we have multiple input sources, we need a transformation to run through each source and inject the metadata. Each input source is specified through a variable in a Transformation Executor step, which calls for the metadata injection transformation.

Process_all_suppliers:

This is a simplified example for illustration to store the data in a text file. It can even be used in all sorts of use cases and can store the data in SQL, NoSQL databases or Big Data.

ETL Metadata injection

The post Metadata Injection Using Pentaho appeared first on TatvaSoft Blog.

]]>
https://www.tatvasoft.com/blog/metadata-injection-using-pentaho/feed/ 0
Big Data Empowers Retailers with Competitive Advantages https://www.tatvasoft.com/blog/big-data-analytics-retailers-competitive-advantages/ https://www.tatvasoft.com/blog/big-data-analytics-retailers-competitive-advantages/#respond Wed, 15 Jun 2016 07:20:10 +0000 https://www.tatvasoft.com/blog/?p=2001 In today's highly competitive world, data access and usage in every business have become crucial to understand customers and take business decisions wisely. Customers are demanding a seamless experience across all the channels they shop, starting from the basic search till completing a transaction.

The post Big Data Empowers Retailers with Competitive Advantages appeared first on TatvaSoft Blog.

]]>
In today’s highly competitive world, data access and usage in every business have become crucial to understand customers and take business decisions wisely. Customers are demanding a seamless experience across all the channels they shop, starting from the basic search till completing a transaction. From organizations’ point of view, it is critical to connect every dots to gain customer insights, like product in demand, shopping preferences, shopping experience, geographic location, currency, and others. With the help of all such information, an important concept that has grabbed retailer’s attention is “Big Data”.

Big Data do not only mean the data of a customer or a product. It means all types of data from multiple sources. Today, the customer has easy access to find their product at the best price by comparing online. If they do not get a personalized experience, chances are there that they will not repeat purchase. Retailers with the absence of insights drag their business away from the latest market trend and product prices.

Big Data Opportunity for Retail Business

Although retailers collect data from customers, but they could not use it effectively. The Big Data trend has now changed the retail trend to customer-centric. Using the data to understand the customer behavior, enabled to implement a targeted marketing campaign. Every business wants to connect with the customers irrespective of their location, give personalized service, but doing so holds the competitive challenge. Therefore, collecting multiple chunks of data determine the customer’s behavior.

A statistic report by Thorsen from Science Daily says, 90% of data has been collected in last two years. Data from social media, mobile and local have been generated in an unstructured form which cannot be managed by business solutions and data warehouse.

This gives birth to the need of Big Data Analytics solutions through which the businesses can leverage the available data and drive marketing campaign, revenue and run their business effectively. This also gives an opportunity to understand how the customer is connected with the business.

Leveraging Big Data for Retail

Historical data give general info of customers, but not enough detail. Business demands granular data to understand an individual behavior. To make technology easier for the customers so that they gain seamless experience, retailers tend to leverage this new trend to shift their cultural habit and gain insight of this technology more as described underneath:

  • Fuzzy Logic:

    This concept denotes identifying the relation between different data elements. For e.g., a customer is sure about his product requirement, but hasn’t exposed to several options before, on the other hand, retailers don’t have same product what customer wanted, but have something similar to their choice. Here’s when Fuzzy Logic comes where the retailers provide something similar to customer search to gain their trust. This logic implies, when both the parties agree with the closest option available.

  • Stock Prediction:

    Earlier, predicting stock was a simple process due to limited data elements where the shopping was limited to few seasons or occasion. Big Data offer limitless opportunity to predict stock ahead of several variables like seasons, weather, trends and much more. Retailers can now focus on only selling the product than stock analyzing.

  • Improved Shopping Concept:

    Knowing what majority of customers want or why they are abandoning the shopping cart is much needed insight by retailers. Big Data Analysis is an intelligent way to understand customer’s shopping behavior and redefine the shopping process. By leveraging this technology, a retailer gets access to customer’s interest, geographic location close to their stock location and suggest a better deal to the customer. If a customer bought the product then this is a win for retailers.

  • Customer Rewards:

    Big Data allows retailers to keep a track of their returning customers who are a valuable resource for their business. To please the customers and keep them intact with the brand, the company offers loyalty rewards to each loyal customers. Analytics play a vital role in loyalty programs by studying customer behavior and shopping pattern. It has become an effective way to determine CRM strategies.

  • Fraud Detection and Prevention:

    When a business moves online, fraud detection becomes a major concern. As transactions go online, sophisticated fraudulent activities started. Analytics gathered all unstructured data and analyze to identify mismatched pattern at an early stage.

Big Data May Fail When…

  • Lack of Business Objectives:

    Big Data is a trend now, but despite its worldwide acceptance, it fails for some businesses. Especially, when a company invests in Analytics tool but do not have any clear vision on how they can utilize the data, there are chances of failure. Without identifying the problem and how Big Data can be helpful, the solution will remain undefined. So, it is advisable to prioritize the business objectives and identify a complex problem that needs to be solved.

  • Lack of Skills:

    Big Data fail if the right set of minds is not involved in questioning the data. Finding the right talent to understand a big data concept is a big challenge, therefore, retailers have to try hard to find the right people for their analysis.

  • Lack of Domain Expertise:

    Big data can be utilized if its core domain is understood. Even after hiring experienced data experts, lack of their domain knowledge leads to failure. Therefore, it’s always good to hire the domain expertise data scientist.

How Big Data Benefits Retail Business?

Retailers have now become more data driven, therefore, they should now aggregate it to improve their performance and meets customers demand. Big Data expands opportunities to target potential customers effectively and enabled to identify the most valuable and potential customers. This data repository fuels predictive analysis to widen the scope of improvement. Big Data is moving to all industries, and to identify the right information for business success, an analytics solution can help to gain customer insights and operate efficiently. Partnering with a right custom software development company, like TatvaSoft, with expertise in Big Data & Analytics Solutions, is absolutely critical to converting ideas into working solutions.

The post Big Data Empowers Retailers with Competitive Advantages appeared first on TatvaSoft Blog.

]]>
https://www.tatvasoft.com/blog/big-data-analytics-retailers-competitive-advantages/feed/ 0
Evolving Marketing Ecosystem Around Big Data https://www.tatvasoft.com/blog/how-can-organization-growth-be-influenced-by-big-data/ https://www.tatvasoft.com/blog/how-can-organization-growth-be-influenced-by-big-data/#respond Mon, 30 May 2016 07:26:08 +0000 https://www.tatvasoft.com/blog/?p=2009 Big Data is driving the world and forcing the organizations to have a strategic approach to analyzing the data for taking better-informed decisions. Big Data means bigger opportunity and bigger challenges. It is mainly characterized by its volume, variety, variability, and complexity of the information.

The post Evolving Marketing Ecosystem Around Big Data appeared first on TatvaSoft Blog.

]]>
Big Data is driving the world and forcing the organizations to have a strategic approach to analyzing the data for taking better-informed decisions. Big Data means bigger opportunity and bigger challenges. It is mainly characterized by its volume, variety, variability, and complexity of the information. However, the volume is not the focus area as what important is its utilization to answer customer queries by being more data savvy.

In today’s digital age, Big Data is a buzz in every organization where the marketers are making considerable use of it to improve customer experience, boost customer interaction, increase revenue, reduce costs, and engage with customers in many different ways.

Below mentioned are few points of how the marketing ecosystem is evolving:

Big Data Statistics and Future Predictions

  • 2.5 quintillion bytes of data are produced every day
  • Walmart handles more than 1 million of client’s transaction every hour
  • More than 5 billion people use mobile phones globally
  • An investment of $200 million is to be made by The Obama Administration in Big Data research projects

According to Oracle’s predictions:

  • Technologies like Artificial Intelligence will be applied to the data challenges
  • Companies to classify the file systems into various categories expecting academics, politics, and columnist
  • Companies to move to hybrid cloud deployment to save cost and drive regulatory compliance

Big Data Benefits to Marketers

Big Data Benefits to Marketers

Targeted Marketing Opportunities

Big Data can be analyzed to check for the new services launched in the market or its benefits, and what are other opportunities that are still existent in the market. Sales people can adopt this technique to find valuable prospect and expand existing opportunity.

Optimizing Customer Engagement

Customer demands more data before engaging with a brand. Big Data can be used to deliver more precise information about the customers by having an interaction with them to gain information like who they are, what are their likes or dislikes, the exact time, and place to contact them, etc. This will allow brands to know and then address their customers request quickly and have a better control on them.

Gain Competitive Advantage

Big Data can be used to know more about the established competitors and the new entrants in the industry and allow each one to innovate, capture and create value and outperform the competitors.

Marketing Performance

With the help of Big Data, companies will be able to measure the spending and remain more focused towards the target. Companies can create marketing campaigns which would keep their audience more engaged.

Big Data Challenges

Infrastructure

Because Big Data include both structured and unstructured data collected from different sources, one must be able to understand and sort out the useful data which needs to be analyzed and stored. That would also require a huge bunch of technical knowledge and disk space.

Security and Privacy

Privacy of the individual data and how to protect data is an increasingly important issue in the digital world. Companies face difficulties in identifying the right data.

Lack of Skills

Skilled Big Data analytics workers are becoming hard to find. Some are relatively new and some at very top positions in few organizations. It becomes difficult for companies to identify the right people who can utilize and use the data by working with the latest technology trends.

Using Big Data for Marketing

Big Data for Deeper Insights

Big Data offer a greater opportunity to search for data deeper till one gets better insights. Once insights are received it can be analyzed and explored further each time. Thus, this level of insights will help in the growth and result in better outcomes.

Appointment of Data Scientist

It becomes necessary to appoint a data scientist who has a sound knowledge and is capable of taking decisions with the help of the data. They should be technically wise and able to analyze the data which comes from different sources and look for the data that will be best suited for the organization.

Educate about the Importance of Big Data

It is very important on the part of the data scientist to educate the members of the organization about the Big Data, especially the marketing team. The goal should be to make people understand how to make use of it for taking better decisions.

Conclusion

Data can become a secret weapon in this highly competitive world. With the growth of Big Data and Analytics solutions, data becomes readily available to the marketers about their customers’ needs which help them to offer meaningful and valuable solutions.

The post Evolving Marketing Ecosystem Around Big Data appeared first on TatvaSoft Blog.

]]>
https://www.tatvasoft.com/blog/how-can-organization-growth-be-influenced-by-big-data/feed/ 0
Data Mining with Weka https://www.tatvasoft.com/blog/data-mining-with-weka/ https://www.tatvasoft.com/blog/data-mining-with-weka/#respond Sun, 31 Jan 2016 22:21:43 +0000 https://www.tatvasoft.com/blog/?p=2040 In today’s world of Data Explosion, the problem of managing the information becomes extremely difficult, which can lead to overload and chaos. Luckily, there are some tools, technologies and methodologies which helps managing the abundant data and extracting valuable insights from it.

The post Data Mining with Weka appeared first on TatvaSoft Blog.

]]>
In today’s world of Data Explosion, the problem of managing the information becomes extremely difficult, which can lead to overload and chaos. Luckily, there are some tools, technologies and methodologies which helps managing the abundant data and extracting valuable insights from it. One of the most important of all methodologies is Data Mining and one such tool is Weka. Before we learn more about Weka, first of all lets talk about what is Data Mining.

What is Data Mining?

Data Mining is a key process in analyzing the Big Data. It is the computational process of unfolding patterns in the large raw data set which will support making right business decisions and designing strategies for organizational growth.

Raw Data in data mining process could be anything as below:

  • CSV files (comma separated values)
  • Data warehouse
  • CRM
  • Transactional Data
  • Text, fact files, etc.

Data mining process is mainly applied on data warehouse (collection of large amount of data) by using query method to generate result. This process identifies relationship between input data, analyzes patterns and extracts information which gets transformed into user understandable format like, dashboard, tables, charts, reports, etc.

The resultant data generated is called “information”, thus it can be concluded that knowledge discovery is the main aim of data mining process.

Data Mining Process

Lets try to understand the data mining process with an example of Market Basket Analysis. In grocery shop or a mall, the best discount offer to be given to a customer, is decided by analyzing the products which are mostly bought together. For e.g. the customer who buys milk will mostly buys bread.

Having understood a methodology, now lets talk about one of the best tool for data mining – Weka.

What is Weka?

Weka

Weka is a data mining visualization tool which contains collection of machine learning algorithms for data mining tasks. It is an open source software issued under the GNU General Public License. It provides result information in the form of chart, tree, table etc.

Weka expects the data file to be in Attribute-Relation File Format (ARFF) file. So, first we have to convert any file into ARFF before we start mining with it in Weka.

Features of Weka

Data Preprocessing: It is cleaning of data while data gathering and selection phase. It removes/adds default value to missing fields and resolve conflicts.

Data Classification and Prediction: It classifies data based on relationship between things and predicts data label. For e.g., A Bank, based on available data of loan, classifies and predicts customer label ‘risky’ or ‘safe’.

Clustering: Group of related data into cluster, used to discover distinct group. For e.g., We have data of weather and based on that we want to decide whether to play outside or not, in such case, using Weka tool we can visualize overall data and can make decision according to the charts.

As shown in above image, current data is loaded from weather.arff file and there are 5 attributes: outlook, temperature, humidity, windy, play. Temperature is the selected attribute and we need to decide about playing, i.e whether to play outside or not. Weka applies data mining on the available data and produces result which is displayed in right corner chart (blue = play outside and red = not play). Chart is used to visualize play attribute with respect to temperature, so as per above “if temperature is 64 to 75 => play outside”.

Weka also provides various data mining techniques like filters, classification and clustering. Here is another example of data mining technique that is classification using J48 algorithm.

Data Mining Chart

Figure: Classification Algorithm

The figure is the result of Classification algorithm J48 in Weka and it displays information in a tree view. By visualization of tree analyzer one can decide to play outside or not.

The post Data Mining with Weka appeared first on TatvaSoft Blog.

]]>
https://www.tatvasoft.com/blog/data-mining-with-weka/feed/ 0