FOUNDATION OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT
ORGANIZING DATA IN A TRADITIONAL FILE ENVIRONMENT
File organization concepts
• Record: Group of related fields
• File: Group of related files
• Field: Group of characters as word(s) or number
- Describes an entity (person, place, thing on which we store information
- Attribute: Each characteristic, or quality, describing entity
- Example: Attributes Date or Grade belong to entity Course
The Data Hierarchy
A computer system organizes data in
a hierarchy that starts with the bit, which represents either a 0 or a 1. Bits
can be grouped to form a byte to represent one character, number, or symbol.
Bytes can be grouped to form a field, and related fields can be grouped to form
a record. Related records can be collected to form a file, and related files
can be organized into a database.
Problems with the traditional file processing (filesmaintained separately by different departments)
- Data redundancy: Presence
of duplicate data in multiple files. - Data inconsistency: Same
attribute has different values
- Program-data dependence:
•When
changes in program requires changes to data accessed by program
- Lack of flexibility
- Poor security
- Lack of data sharing and availability
THE DATABASE APPROACH TO DATA MANAGEMENT
•Database:
- Collection
of data organized to serve many applications by centralizing data and controlling redundant data
•Database
management system:
- Interfaces
between application programs and physical data files
- Separates
logical and physical views of data
- Solves
problems of traditional file environment
i) Controls
redundancy
ii) Eliminated
inconsistency
iii) Uncouples
programs and data
iv) Enables
central management and security
•Relational
DBMS
- Represent
data as two-dimensional tables called relations or files
- Each
table contains data on entity and attributes
•Table:
Grid of columns and rows
- Rows (tuples):
Records for different entities
- Fields (columns):
Represents attribute for entity
- Key field:
Field used to uniquely identify each record
- Primary key:
Field in table used for key fields
- Foreign key:
Primary key used in second table as look-up field to identify records from
original table
•Operations
of a Relational DBMS: Three
basic operations used to develop useful sets of data
i) SELECT:
Creates subset of data of all records that meet stated criteria
ii) JOIN:
Combines relational tables to provide user with more information than available
in individual tables
iii) PROJECT:
Creates subset of columns in table, creating tables with only the information
specified
•Capabilities
of Database Management Systems
- Data definition capability: Specifies
structure of database content,used to create tables and define characteristics of fields.
- Data dictionary: Automated
or manual file storing definitions of data elements and their
characteristics.
- Data manipulation language: Used
to add, change, delete, retrieve data from database
•Structured
Query Language (SQL)
•Microsoft
Access user tools for generation SQL
- Many DBMS have report
generation capabilities for
creating polished reports
(Crystal Reports)
•Designing
Databases
- Conceptual
(logical) design: abstract model from business perspective
- Physical
design: How database is arranged on direct-access storage devices
•Design
process identifies
- Relationships
among data elements, redundant database elements
- Most
efficient way to group data elements to meet business requirements,
needs of
application programs
•Normalization
- Streamlining
complex groupings of data to minimize redundant data elements
and awkward
many-to-many relationships
•Business intelligence infrastructure
- Today includes an array of tools for separate systems, and big data
• Contemporary tools-
i) Data warehouses:
•Stores
current and historical data from many core operational transaction systems
•Consolidates
and standardizes information for use across enterprise, but data
cannot be
altered
•Data
warehouse system will provide query, analysis, and reporting tools
ii) Data marts:
•Subset
of data warehouse with summarized or highly focused portion of firm’s data for
use by specific population of users
•Typically
focuses on single subject or line of business
iii) Hadoop:
• Enables distributed parallel processing of big data across inexpensive computers.
• Key services
- Hadoop distributed file system (HDFS): data storage
- Mapreduce: breaks data into clusters for work
- Hbase: NoSQL database
• Used by Facebook, Yahoo, NexBio
iv) In-memory computing:
• Used computers main memory (RAM) for data storage to avoid delays in retrieving data from disk storage.
• Used in big data analysis.
• Can reduce hours/days of processing to seconds
• Requires optimized hardware.
v) Analytical platforms:
- High-speed platforms using both relational and non-relational tools optimized for large datasets.
•Analytical tools: Relationships, patterns, trends
- Tools for consolidating, analyzing, and providing access to vast amounts of data to help
users make better business decisions.
i) Multidimensional data analysis (OLAP):
- Online analytical processing (OLAP)
•Analytical tools: Relationships, patterns, trends
- Tools for consolidating, analyzing, and providing access to vast amounts of data to help
users make better business decisions.
i) Multidimensional data analysis (OLAP):
- Online analytical processing (OLAP)
•Supports
multidimensional data analysis
•Enables
viewing data using multiple dimensions
•Each
aspect of information (product, pricing, cost, region, time period) is
different dimension
•Example: How many washers sold in East in June compared with others
regions?
- OLAP
enables rapid, online answers to ad hoc queries
ii) Data Mining:
-Finds
hidden patterns, relationships in large databases
-Infers
rules to predict future behavior
-The patterns and rules are used to
guide decision making and forecast the
effect of those decisions
-Popularly used to provide detailed
analyses of patterns in customer data for
one-to-one marketing campaigns or to
identify profitable customers.
-Less well known: used to trace
calls from specific neighborhoods that use
stolen cell phones and phone
accounts.
- Types of information obtainable from data mining:
•Associations:
Occurrences
linked to single event
•Sequences:
Events
linked over time
•Classification:
Recognizes
patterns that describe group to which item
belongs
•Clustering:
Similar
to classification when no groups have been defined;
finds groupings within data
•Forecasting:
Uses
series of existing values to forecast what other
values will be
iii) Text Mining:
- Extracts key elements from large unstructured data sets.
• Stored e-mails
• Call center transcripts
• Legal cases
• Patent descriptions
• Service reports, and so on
- Sentiment analysis software
• Mines e-mails, blogs, social media to detect opinions.
iv) Web Mining:
- Discovery and analysis of useful patterns and information from web.
• Understand customers behavior.
• Evaluate effectiveness of web site, and so on.
- Web content mining.
• Mines content of Web page
- Web structure mining
• Analyzes links to and from web page
- Web usage mining
• Mines user interaction data recorded by web serve.
v) Databases and the Web
- Many
companies use Web to make some internal databases available to customers or
partners
- Typical
configuration includes:
•Web
server
•Application
server/middleware/CGI scripts
•Database
server (hosting DBM)
- Advantages
of using Web for database access:
•Ease
of use of browser software
•Web
interface requires few or no changes to database
•Inexpensive
to add Web interface to system
No comments:
Post a Comment