Article preview image

Practical Data Warehousing: Successful Cases

Table of contents:.

No matter how smooth the plan may be in theory, practice will certainly make adjustments. Because each real case has its own characteristics, which in the general case cannot be taken into account. Let's see how the world's leading brands have adapted to their needs a well-known way of storing information — data warehousing. If you think this is your case, then arrange a call .

Global Data Warehousing Market By Application

The Reason for Making Decisions

The need to make business decisions based on data analysis has long been beyond doubt. But to get this data, it needs to be collected, sorted and prepared for analytics .

Operating Supplement

supplier integrations

cost reduction

David Schwarz photo

David Schwarz

Operating Supplement case image

DATAFOREST has the best data engineering expertise we have seen on the market in recent years.

This is what data warehousing specialists do. To focus on the best performance, it makes sense to consider how high-quality custom assemblies came out of this constructor.

Data warehousing interacts with a huge amount of data

A data warehousing is a digital storage system that integrates and reconciles large amounts of data from different sources. It helps companies turn data into valuable information and make informed decisions based on it. Data warehousing combines current and historical data and acts as a single source of reliable information for business.

After raw data mining (extract, transform, load) info enters the warehouse from operating systems, such as an enterprise data resource planning system or a customer relationship management system. Sources also include databases, partner operational systems, IoT devices, weather apps, and social media. Infrastructure can be on-premises or cloud-based, with the latter option predominating in recent times.

Data warehousing is necessary not only for storing information, but also for processing structured and unstructured data: video, photos, sensor indicators. Some data warehousing options use built-in analytics and in-memory database data technology (info is stored in RAM rather than on a hard drive). This is necessary to access reliable data in real time.

After data is sorted, it is sent to data marts for further analysis by BI or data science .

Why consider data warehousing cases

Consideration of known options for data warehousing is necessary, first of all, in order not to keep making the same mistakes. Based on a working solution, you can improve your own performance. If you want to always be on the cutting edge of technology, book a call .

  • When using data warehouses, executives access data from different sources, they do not have to decide blindly.
  • Data warehousing is needed for quick retrieval and analysis. When using warehouses, you can quickly request large amounts of data without involving personnel for this.
  • Before uploading to the warehouse, the system creates data cleansing tasks and puts them for further processing, ensuring converting the data into a consistent format for subsequent analyst reports.
  • The warehouse contains large amounts of historical data and allows you to study past trends and issues to predict events and improve the business structure.

Blindly repeating other people's decisions is also impossible. Your case is unique and probably requires a custom approach. At best, well-known storage solutions can be taken as a basis. You can do it yourself, or you can contact DATAFOREST specialists for professional services. We have a positive experience and positive customer stories of data warehousing creating and operating.

Data warehousing cases

Case 1: How the Amazon Service Does Data Warehousing

Amazon is one of the world's largest and most successful companies with a diversified business: cloud computing, digital content, and more. As a company that generates vast amounts of data (including data warehousing services), Amazon needs to manage and analyze its data effectively.

Two main businesses

Amazon's data warehousing needs are driven by the company's vast and diverse data sources, which require sophisticated tools and technologies to manage and analyze effectively.

1. One of the main drivers of Amazon's business is its e-commerce platform , which allows customers to purchase a wide range of products through its website and mobile apps. Amazon's data warehousing needs in this area are focused on collecting, storing, and analyzing data related to customer behavior, purchase history, and other metrics. This data is used to optimize Amazon's product recommendations engine, personalize the shopping experience for individual customers, and identify growth strategies.

2. Amazon's other primary business unit is Amazon Web Services (AWS), which offers cloud computing managed services to businesses and individuals. AWS generates significant amounts of data from its cloud data infrastructure, including customer usage and performance data. To manage and analyze this modern data effectively, Amazon relies on data warehousing technologies like Amazon Redshift, which enables AWS to provide real-time analytics and insights to its customers.

3. Beyond these core businesses, Amazon also has significant data warehousing needs in digital content (e.g., video, music, and books). Amazon's advertising business relies on data analysis to identify key demographics and target ads more effectively to specific audiences.

By investing in data warehousing and analytics capabilities, Amazon through digital transformation can maintain its competitive edge and continue to grow and innovate in the years to come.

Do you want to streamline your data integration?

Obstacles on the way to the goal.

Amazon faced several specific implementation details and challenges in its data warehousing efforts.

• The brand needed to integrate data from various sources into a centralized data warehouse. It required the development of custom data pipelines to collect and transform data into a standard format.

• Amazon's data warehousing needs are vast and constantly growing, requiring a scalable solution. The company distributed data warehouse architecture center using technologies like Amazon Redshift, allowing petabyte-scale data storage and analysis.

• As a company that generates big data, Amazon would like to ensure that its data warehousing solution could provide real-time data analytics and insights. Achieving high performance requires optimizing data storage, indexing, and querying processes.

• Amazon stores sensitive customer data in its warehouse, prioritizing data security. To protect against security threats, the brand implements various security measures, including encryption, access controls, and threat detection.

• Building and maintaining a data warehousing solution can be expensive. Amazon leverages cloud-based data warehousing solutions (Redshift) to minimize costs, which provide a cost-effective, pay-as-you-go pricing model.

Amazon's data warehousing implementation required careful planning, significant investment in technology and infrastructure, and ongoing optimization and maintenance to ensure high performance and reliability.

Change for the better

When Amazon considered all the needs, found the right tools, and implemented a successful data warehouse, the company got the following main business outcomes:

• Improved data driven decision

• Better customer enablement

• Cost effective decision

• Improved performance

• Competitive advantage

• Scalability

Amazon's data warehousing implementation has driven the company's growth and success. Not surprisingly, a data storage service provider must understand data storage. The cobbler's children don't need to have no shoes.

Case 1: How the Amazon Service Does Data Warehousing

Case 2: Data Warehousing Adventure with UPS

United Parcel Services (UPS) is an American parcel delivery and supply chain management company founded in 1907 with an annual revenue of 71 billion dollars and logistics services in more than 175 countries. In addition, the brand distributes goods, customs brokerage, postal and consulting services. UPS processes approximately 300 million tracking requests daily. This effect was achieved, among others, thanks to intelligent data warehousing.

One mile for $50 million

In 2013, UPS stated that it hosted the world's largest DB2 relational database in two United States data centers for global operations. Over time, global operations began to increase, as did the amount of semi structured data. The goal was to use different forms of storage data to make better users business decisions.

One of the fundamental problems was route optimization. According to an interview with the UPS CTO, saving 1 mile a day per driver could save 1.5 million gallons of fuel per year or $50 million in total savings.

However, the data was distributed in DB2; some included repositories, some local, and some spreadsheets. UPS needed to solve the data infrastructure problem first and then optimize the route.

Four letters "V."

The big data ecosystem efficiently handles the four "Vs": volume, validity, velocity, and variety. UPS has experimented with Hadoop clusters and integrated its storage details and computing system into this ecosystem. They upgraded data warehousing and computing power to handle petabytes of data, one of UPS's most significant technological achievements.

The following Hadoop components were used:

• HDFS for storage

• Map Reduce for fast processing

• Kafka streaming

• Sqoop (SQL-to-Hadoop) for ingestion

• Hive & Pig for structured queries on unstructured data

• monitoring system for data nodes and names

But that's just speculation because, due to confidentiality, UPS didn't declassify the tools and technologies they used in their big data ecosystem.

Constellation of Orion

The result was a four-year ORION (On-Road Integrated Optimization and Navigation) route optimization project. Costs — about one billion dollars a year. ORION used the results to data stores and calculate big data and got analytics from more than 300 million data points to optimize thousands of routes per minute based on real-time information. In addition to the economic benefits, the Orion project shortened approximately 100 million shipping miles and a 100,000-ton reduction in carbon emissions.

Case 2: Data Warehousing Adventure with UPS

Case 3: 42 ERP Into One Data Warehouse

In general, the topic of specific cases of data warehousing implementation is sufficiently secret. There may be cases of consent and legitimate interests in the contracts. There are open-source examples of work, but the vast majority are on paid libraries. The subject is so relevant that you can earn money from it. Therefore, sometimes there are "open" cases, but the brand name is not disclosed.

Brand X needs help

World leader in industrial pumps, valves, actuators, controls, etc., needed help extracting data from disparate ERP systems. They wanted it from 42 ERP instances, standardized flat files, and collected all the information in one data warehouse. The ERP systems were from different vendors (Oracle, SAP, BAAN, Microsoft, PRMS) to complicate future matters.

The client also wanted a core set of metrics and a central dashboard to combine all the information from different locations worldwide. The project resulted from a surge in demand for corporate data from database management. The company knew its data warehousing needed a central repository for all data from its locations worldwide. Requests often came from top to bottom, and when an administrator required access to the correct data, there were logistical extracting problems. And the project gets started.

Are you interested in enhanced insights through data aggregation?

The foundation stone.

The hired third-party developer center has made a roadmap, according to which ERP data was taken from 8 major databases and placed in a corporate data warehouse. It entailed integrating 5 Oracle ERP instances with 3 SAP ERP. Rapid Marts have also been integrated into Oracle ERP systems to improve the project's progress.

One of the main challenges was the need for more standardization of fields or operational data definitions in ERP systems. To solve this problem, the contractor has developed a data service tool that allows access to the back end of the database and displays info suitably. Since then, the customer has known which fields to use and how to set them each time a new ERP instance is encountered. These data definition patterns were the project's foundation stone and completely changed how customer data is handled. It was a point to launch consent.

All roads lead to data warehousing

The company has one common and consistent way to obtain critical indicators. The long-term effect of the project is the ease of obtaining information. What was once a long and inconsistent process of getting relevant information at an aggregate level is now streamlined to store data in one central repository with one team controlling it.

Case 3: 42 ERP Into One Data Warehouse

Data Warehousing: Different Cases — General Conclusions

Each data warehouse organization has unique methods and tools because business needs differ. In this case, data warehousing can be compared with a mosaic and a children's constructor. You can make different figures from the same parts, arranging the elements especially. And if one part is lost or broken, you need to make a new one or find another one and "process it with a rasp."

Generalities between different cases of data warehousing

There are several common themes and practices among successful data warehousing implementations, including:

• Successful data warehousing implementations start with clearly understanding the business objectives and how the warehouse (or data lake) can support those objectives.

• The data modeling process is critical to the success of data warehousing.

• The data warehouse is only as good as the data it contains.

• Successful data warehousing requires efficient data integration processes that can operate large volumes of data and ensure consistency and accuracy.

• Data warehousing needs ongoing performance tuning to optimize query performance.

• A critical factor in data warehousing is a user-friendly interface that makes it easy for end users to access the data and perform complex queries and analyses.

• Continuous improvement is essential to ensure the data warehouse remains relevant and valuable to the business.

Competent data warehousing implementations combine technical expertise and a deep understanding of business details and user needs.

Your case is not mentioned anywhere

When solving the problem of organizing data warehousing , one would like to find a description of the same case and do everything according to plan. But the probability of this event is negligible — you will have to adapt to the specifics of the customer's business and consider your knowledge and capabilities, as well as the technical and financial conditions of the project. Then it would help if you took a piece of the puzzle or parts of the constructor and built your data warehouse. Minus — you have to work. Plus — it will be your decision on data storage and only your implementation.

Data Warehouse-as-a-Service Market Size Global Report, 2022 - 2030

Data Warehousing Is Like a Trampoline

Changes in data warehousing , like any technological and methodological changes, are carried out to improve the data collection, storage, and analysis level. It takes the customer to a new level in his activity and the contractor — to his own. Like a jumper and a trampoline: separately, it is just a gymnast and just equipment, and in combination, they give a certain third quality — the possibility of a sharp rise.

If you are faced with the problem of organizing a new data warehousing system, or you are simply interested in what you read, let's exchange views with DATAFOREST.

What is the benefit of data warehousing for business?

A data warehouse is a centralized repository that contains integrated data from various sources and systems. Data warehousing provides several benefits for businesses: improved decision-making, increased efficiency, better customer insights, operational efficiency, and competitive advantage.

What is the definition of a successful data warehousing implementation?

The specific definition of a successful data warehouse implementation will vary depending on the goals of the organization and the particular use case for data warehousing. Some common characteristics are: meeting business requirements, high data quality, scalability, user adoption, and positive ROI.

What are the general considerations for implementing data warehousing?

Implementing data warehousing involves some general considerations: business objectives, data sources, quality and modeling, technology selection, performance tuning, user adoption, ongoing maintenance, and support.

What are the most famous examples of the implementation of data warehousing?

There are many famous examples of the implementation of data warehousing across industries:

• Walmart has one of the largest data warehousing implementations in the world

• Amazon's data warehousing solution is known as Amazon Redshift

• Netflix uses a data warehouse to store and analyze data from its streaming platform

• Coca-Cola has a warehouse to consolidate data from business units and analyze it

• Bank of America analyzes customer data by data warehousing to improve customer experience

What are the challenges while implementing data warehousing, and how to overcome them?

Based on the experiences of organizations that have implemented data warehousing, some common challenges and solutions are:

• Ensuring the quality of the data that is being stored and analyzed. You must establish data quality standards and implement data validation and cleansing by data types.

• Integrating from disparate data sources. Establishing a clear data integration strategy that considers the different data sources, formats, and protocols involved is vital.

• As the amount of data stored in a data warehouse grows, performance issues may arise. A brand should regularly monitor query performance and optimize the data warehouse to ensure that it remains efficient and effective.

• To ensure that sensitive data stored in the data warehouse is secure. It involves implementing appropriate measures such as access controls, encryption, and regular security audits. They are details of privacy security.

• Significant changes to existing processes and workflows. Solved by establishing a transparent change management process that involves decision-makers and users at all levels.

What is an example of how successful data warehousing has affected a business?

An example of how successful data warehousing has affected Amazon is its recommendation engine. It suggests products to customers based on their browsing and purchasing history. By using artificial intelligence and machine learning algorithms to analyze customer data, Amazon has improved the fully managed accuracy of its recommendations, resulting in increased sales and customer satisfaction.

What role does data integration play in data warehousing?

Data integration is critical to data warehousing, enabling businesses to consolidate and standardize data from multiple sources, ensure data quality, and establish effective data governance practices.

How are data quality and governance tracked in data warehousing?

Data quality and governance are tracked in data warehousing through a combination of data profiling, monitoring, and management processes and establishing data governance frameworks that define policies and procedures for managing data quality and governance. So, businesses can ensure that their data is accurate, consistent, and compliant with regulations, enabling effective decision-making and driving business applications' success.

Are there any measures to the benefits of data warehousing?

The benefits of business data warehousing can be measured through improvements in data quality, efficiency, decision-making, revenue and profitability, and customer satisfaction. By tracking these metrics, businesses can assess the effectiveness of their data warehousing initiatives and make informed decisions about future investments in data management and analytics with cloud services.

How to avoid blunders when warehousing data?

By following the best practices, businesses can avoid common mistakes, minimize the risk of blunders when warehousing data, and ensure their data warehousing initiatives are successful and practical to be analyzed with business intelligence.

Aleksandr Sheremeta photo

Aleksandr Sheremeta

Get More Value!

You will get from us best tailored content that will help your business grow.

Thanks for your submission!

latest posts

Top-15 best ai tools for business: powered growth, ai in sales: a well-calculated move, generative ai applications in large businesses, media about us, when it comes to automation, choosing the right partner has never been more important, 15 most innovative database startups & companies, 10 best web development companies you should consider in 2022, try to trying.

Never give up

We love you to

People like this

Success stories

Web app for dropshippers.

hourly users

Shopify stores

Financial Intermediation Platform

model accuracy

timely development

E-commerce scraping

manual work reduced

pages processed daily

DevOps Experience

QPS performance

Supply chain dashboard

system integrations

More publications

case study of data warehouse

Let data make value

We’d love to hear from you.

Share the project details – like scope, mockups, or business challenges. We will carefully check and get back to you with the next steps.

DATAFOREST worker

Stay a little longer
and explore what we have to offer!

Build fully managed real-time data pipelines in minutes.

Estuary 101 Webinar

Connect&GO improves productivity 4x with Estuary.

True Platform logo

True Platform discovered seamless, scalable data movement.

Soli & Company logo

Soli & Company trusts Estuary’s approachable pricing and quick setup.

Product tour - 2 minutes

Real-Time Data Warehouse Examples (Real World Applications)

Discover how businesses are leveraging real-time data warehouses to gain actionable insights, make informed decisions, and drive growth..

Image of author

Gone are the days when organizations had to rely on stale, outdated data for their strategic planning and operational processes. Now,  real-time data warehouses process and analyze data as it is generated, helping overcome the limitations of their traditional counterparts. The impact of real-time data warehousing is far-reaching. From eCommerce businesses to healthcare providers, real-time data warehouse examples and applications span various sectors.

The significance of real-time data warehousing becomes even more evident when we consider the sheer volume of data being generated today. The global data sphere is projected to reach a staggering  180 zettabytes by 2025 . 

With these numbers, it’s no wonder every company is looking for solutions like real-time data warehousing for managing their data efficiently. However, getting the concept of a real-time data warehouse, particularly when compared with a traditional data warehouse, can be quite intimidating, even for the best of us. 

In this guide, with the help of a range of examples and real-life applications, we will explore how real-time data warehousing can help organizations across different sectors overcome the data overload challenge.

  • What Is A Real-Time Data Warehouse?

Blog Post Image

Image Source

A  Real-Time Data Warehouse (RTDW) is a  modern tool for data processing that provides immediate access to the most recent data. RTDWs use real-time  data pipelines to transport and collate data from multiple data sources to one central hub, eliminating the need for batch processing or outdated information.

Despite similarities with traditional data warehouses, RTDWs are capable of  faster data ingestion and processing speeds . They can detect and rectify errors instantly before storing the data, providing consistent data for an effective decision-making process.

Real-Time Data Warehouse Vs Traditional Data Warehouse

Traditional data warehouses act as storage centers for  accumulating an organization’s historical data from diverse sources. They combine this varied data into a unified view and provide comprehensive insights into the past activities of the organization. However, these  insights are often outdated by the time they are put to use , as the data could be days, weeks, or even months old.

On the other hand, real-time data warehousing brings a significant enhancement to this model by  continuously updating the data they house. This dynamic process provides a current snapshot of the organization’s activities at any given time, enabling immediate analysis and action. 

Let’s look at some of the major differences between the two.

Complexity & Cost

RTDWs are  more complex and costly to implement and maintain than traditional data warehouses. This is because they require more advanced technology and infrastructure to handle real-time data processing.

Decision-Making Relevance

Traditional data warehouses predominantly assist in long-term strategic planning. However, the real-time data updates in RTDWs make them  suitable for both immediate, tactical decisions and long-term strategic planning.

Correlation To Business Results

Because of fresher data availability, RTDWs make it easier to  connect data-driven insights with real business results and provide immediate feedback.

Operational Requirements

RTDWs demand constant data updates, a process that can be carried out without causing downtime in the data warehouse operations . Typically, traditional warehouses don't need this feature but it becomes crucial when dealing with data updates happening every week.

Data Update Frequency

While the lines between traditional data warehouses and real-time data warehouses are now blurred due to some data warehouses adopting streaming methods to load data, traditionally, the former updated their data in batches on a daily, weekly, or monthly schedule. As a result, the data some of these data warehouses hold may not reflect the most recent state of the business. In contrast, real-time data warehouses  update their data almost immediately as new data arrives.

3 Major Types Of Data Warehouses

Let's take a closer look at different types of data warehouses and explore how they integrate real-time capabilities.

Enterprise Data Warehouse (EDW)

Blog Post Image

An Enterprise Data Warehouse (EDW) is a  centralized repository that stores and manages large volumes of structured and sometimes unstructured data  from various sources within an organization. It serves as a comprehensive and unified data source for business intelligence, analytics, and reporting purposes. The EDW consolidates data from multiple operational systems and transforms it into a consistent and standardized format.

The EDW is designed to  handle and scale with large volumes of data . As the organization's data grows over time, the EDW can accommodate the increasing storage requirements and processing capabilities. It also acts as a  hub for integrating data from diverse sources across the organization . It gathers information from operational systems, data warehouses, external sources, cloud-based platforms, and more.

Operational Data Store (ODS)

Blog Post Image

An Operational Data Store (ODS) is designed to  support operational processes and provide real-time or near-real-time access to current and frequently changing data. The primary purpose of an ODS is to facilitate operational reporting, data integration, and data consistency across different systems. 

ODS collects data from various sources, like transactional databases and external feeds, and  consolidates it in a more user-friendly and business-oriented format.  It typically stores detailed and granular data that reflects the most current state of the operational environment. 

Blog Post Image

A  Data Mart is a specialized version of a data warehouse that is  designed to meet the specific analytical and reporting needs of a particular business unit , like sales, marketing, finance, or human resources.

Data Marts provide a more targeted and simplified view of data. It contains a  subset of data that is relevant to the specific business area , organized in a way that facilitates easy access and analysis.

Data Marts are  created by extracting, transforming, and loading (ETL) data from the data warehouse or other data sources and structuring it to support analytical needs. They can include pre-calculated metrics, aggregated data, and specific dimensions or attributes that are relevant to the subject area.

  • 11 Applications Of Real-Time Data Warehouses Across Different Sectors 

The use of RTDWs is now common across many sectors. The rapid access to information they provide significantly improves the operations of many businesses, from online retail to healthcare.

Let’s take a look at some major sectors that benefit from these warehouses for getting up-to-the-minute data.

In the dynamic eCommerce industry, RTDWs facilitate immediate data processing that is used to get insights into customer behavior, purchase patterns, and website interactions. This enables marketers to  deliver personalized content, targeted product recommendations, and swift customer service . Additionally, real-time inventory updates help maintain optimal stock levels, minimizing overstock or stock-out scenarios.

RTDWs empower AI/ML algorithms with new, up-to-date data. This ensures models make predictions and decisions based on the most current state of affairs. For instance, in automated trading systems, real-time data is critical for  making split-second buying and selling decisions.

Manufacturing & Supply Chain

RTDWs support advanced manufacturing processes such as  real-time inventory management, quality control, and predictive maintenance . It provides crucial support for business intelligence operations. You can make swift adjustments in production schedules based on instantaneous demand and supply data to  optimize resource allocation and reduce downtime.

RTDWs in healthcare help improve care coordination. It provides  instant access to patient records, laboratory results, and treatment plans, improving care coordination . They also support real-time monitoring of patient vitals and provide immediate responses to critical changes in patient conditions.

Banking & Finance 

In banking and finance, RTDWs give you the  latest updates on customer transactions, market fluctuations, and risk factors . This real-time financial data analysis helps with immediate fraud detection, instantaneous credit decisions, and real-time risk management.

Financial Auditing

RTDWs enable continuous auditing and monitoring to give auditors  real-time visibility into financial transactions . It helps identify discrepancies and anomalies immediately to enhance the accuracy of audits and financial reports.

Emergency Services

RTDWs can keep track of critical data like the  location of incidents, available resources, and emergency personnel status . This ensures an efficient deployment of resources and faster response times, potentially saving lives in critical situations.

Telecommunications

RTDWs play a vital role in enabling efficient network management and enhancing overall customer satisfaction. They provide  immediate analysis of network performance, customer usage patterns, and potential system issues . This improves service quality, optimizes resource utilization, and proactive problem resolution.

Online Gaming

RTDWs provide  analytics on player behaviors, game performance, and in-game purchases  to support online gaming platforms. This enables game developers to promptly adjust game dynamics, improve player engagement, and optimize revenue generation.

Energy Management

In the energy sector, RTDWs provide  instantaneous data on energy consumption, grid performance, and outage situations. This enables efficient energy distribution, quick response to power outages, and optimized load balancing.

Cybersecurity

RTDWs are crucial for cybersecurity as they provide  real-time monitoring of network activities and immediate detection of security threats. This supports swift countermeasures, minimizes damage, and enhances the overall security posture.

  • Real-Time Data Warehouse: 3 Real-Life Examples For Enhanced Business Analytics

To truly highlight the importance of real-time data warehouses, let’s discuss some real-life case studies.

Case Study 1: Beyerdynamic 

Beyerdynamic , an audio product manufacturer from Germany, was facing difficulties with its previous method of analyzing sales data . In this process, they extracted data from their legacy systems into a spreadsheet and then compiled reports, all manually. It was time-consuming and often caused inaccurate reports.  

To overcome these challenges, Beyerdynamic developed a  data warehouse that automatically extracted transactions from its existing ERP and financial accounting systems. This data warehouse was carefully designed to store standard information for each transaction, like product codes, country codes, customers, and regions. 

They also implemented a web-based reporting solution that helped managers create their standard and ad-hoc reports based on the data held in the warehouse.

Supported by an optimized data model, the new system allowed the company to perform detailed sales data analyses and identify trends in different products or markets.

  • Production plans could be adjusted quickly based on changing demand , ensuring the company neither produced excessive inventory nor missed out on opportunities to capitalize on increased demand.
  • With the new system, the company could use  real-time data for performance measurement and appraisal . Managers compared actual sales with targets by region, assessed the success of promotions, and quickly responded to any adverse variances.
  • Sales and distribution strategies could be quickly adapted according to changing demands in the market. For instance, when gaming headphone sales started increasing in Japan, the company promptly responded with tailored promotions and advertising campaigns.

Case Study 2: Continental Airlines 

Continental Airlines is a major player in the aviation world. It  faced significant issues because of old, manual systems. Their outdated approach slowed down decision-making and blocked easy access to useful data from departments like customer service, flight operations, and financials. Also, the lack of real-time data meant that decisions were often based on outdated information.

They devised a robust plan that hinged on 2 key changes: the  ‘Go Forward’ strategy and a  ‘real-time data warehouse’

  • Go Forward Strategy:  This initiative focused on tailoring the airline’s services according to the customer’s preferences. The concept was simple but powerful –  understand what the customer wants and adapt services to fit that mold . In an industry where customer loyalty can swing on a single flight experience, this strategy aims to ensure satisfaction and foster brand loyalty.
  • Real-Time Data Warehouse:  In tandem with the new strategy, Continental also implemented an RTDW. This technological upgrade gave the airline quick access to current and historical data. The ability to extract insights from this data served as a vital reference point for strategic decision-making, optimizing operations, and enhancing customer experiences.

The new strategy and technology led to critical improvements:

  • The airline could offer a personalized touch by understanding and acting on customer preferences. This  raised customer satisfaction and made the airline a preferred choice for many.
  • The introduction of the RTDW brought simplicity and efficiency to the company’s operations. It facilitated quicker access to valuable data which was instrumental in  reducing the time spent on managing various systems. This, in turn, resulted in significant cost savings and increased profitability.

Case Study 3: D Steel 

D Steel, a prominent steel production company, was facing a unique set of challenges when they aimed to  set up a real-time data warehouse to analyze their operations. While they tried to use their existing streams package for synchronization operations, several obstacles emerged.

The system was near real-time but it  couldn't achieve complete real-time functionality.  The load on the source server was significantly high and synchronization tasks required manual intervention.

More so, it lacked automation for  Data Definition Language (DDL) , compatibility with newer technologies, and had  difficulties with data consistency verification, recovery, and maintenance . These challenges pushed the steel company to seek a new solution.

The Solution

D Steel decided to implement real-time data warehouse solutions that enabled instant data access and analysis. 

The new RTDWs system proved to be extremely successful as it resolved all previous problems. It provided:

  • Real-time synchronization
  • Implementing DDL automation
  • Automated synchronization tasks
  • Reduced the load on the source server

The system also introduced a unique function that  compared current year data with that of the previous year  and helped the company in annual comparison analysis.

  • Enhancing Real-Time Data Warehousing: The Role of Estuary Flow

Blog Post Image

Estuary’s Flow is our  data operations platform   that binds various systems by a central  data pipeline . With Flow, you get diverse systems for storage and analysis, like databases and data warehouses. Flow is pivotal in  maintaining synchronization amongst these systems, ensuring that new data feeds into them continuously.

Flow utilizes  real-time data lakes as an integral part of its data pipeline. This serves dual roles. 

First, it works as a transit route for data and facilitates an easy flow and swift redirection to distinct storage endpoints. This feature also helps in backfilling data from these storage points.

The  secondary role of the data lake in Flow is to serve as a reliable storage backbone . You can lean on this backbone without the fear of turning into a chaotic ‘data swamp.’ 

Flow assures automatic organization and management of the  data lake . As data collections move through the pipeline, Flow applies different schemas to them as per the need.

Remember that the data lake in Flow doesn’t replace your ultimate storage solution. Instead, it aims to  synchronize and enhance other storage systems crucial for powering key workflows , whether they're analytical or transactional.

As we have seen with real-time data warehouse examples, this solution transcends industry boundaries. Only those organizations that embrace real-time data warehousing to its fullest can unlock the true potential of their data assets. 

While it can be a little tough to implement, the benefits of real-time data warehousing far outweigh the initial complexities, and the long-term advantages it offers are indispensable in today's data-driven world.

If you’re considering setting up a real-time data warehouse, investing in a top-notch real-time data ingestion pipeline like  Estuary Flow should be your first step. Designed specifically for building real-time data management, Flow provides a no-code solution to synchronize your many data sources and integrate fresh data seamlessly.  Signup for Estuary Flow for free and seize the opportunity today.

Start streaming your data for free

In this article

Popular Articles

debezium alternatives

ChatGPT for Sales Conversations: Building a Smart Dashboard

Author's Avatar

Why You Should Reconsider Debezium: Challenges and Alternatives

debezium alternatives

Don't Use Kafka as a Data Lake. Do This Instead.

Streaming pipelines., simple to deploy., simply priced..

  • Storage Hardware
  • Storage Software
  • Storage Management
  • Storage Networking
  • Backup and Recovery

Logo

Enterprise Storage Forum content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More .

A data warehouse is a data management system used primarily for business intelligence (BI) and analytics. Data warehouses store large amounts of historical data from a wide range of sources and make it available for queries and analysis. These systems are capable of storing large amounts of unstructured data, unlike traditional relational databases, making them ideal for big data projects and real-time data processing. The value of the data in a warehouse grows over time, as the historical record of customer, product, and business process metrics can be analyzed to identify trends and behaviors.

This article looks at 10 common enterprise use cases for data warehouses.

Data Warehouses for Tactical Reporting

Data warehouses are great for storing data for reporting purposes. Because they’re optimized for high-performance queries, they’re perfect for ad-hoc or on-demand operations and performance reporting. Data warehouses are often used to consolidate data from multiple source systems, providing a holistic, global view of how particular factors are interacting with different areas.

Because of their speed and built-in performance optimization, they’re ideal for grabbing information on the go or for urgent matters. They provide answers almost instantly instead of making you wait for hours or days to generate reports the traditional way. The reports are also more accurate, as they include information from across the organization rather than piecemeal, which can lead to silos or outdated information.

Data Warehouses for Big Data Integration

It’s estimated that about 80 percent of data generated by enterprises is unstructured—think emails, PDF documents, social media posts, and multimedia files. Unstructured data is notoriously difficult to house and use effectively, and most solutions are not comprehensive enough to integrate all of your organization’s sources of unstructured data effectively, which means you’ll either miss important insights or have subpar results when compared to what you could achieve with an enterprise-grade data warehouse.

Using a data warehouse, the flow of data is more trustworthy because it has been verified at least once by multiple parties through on-demand data queries. It will also let you automate big data analysis, which gives analysts more time to focus on deep dives into specific problems rather than trying to wrangle disparate tools and solutions together. By gathering both structured and unstructured data from multiple sources across your organization and storing it in a data warehouse, you can create a more holistic view of your business’s data for processing and analysis.

Data Warehouses for Natural Language Processing (NLP)

Many organizations are looking to improve customer service through natural language processing (NLP), which allows for quick analysis and provides opportunities for growth in the support, sales, and marketing departments.

A data warehouse can store the massive amounts of structured and unstructured data submitted by customers and clients, which can then be analyzed using NLP models. Adequate analysis of this data leads to a real-time response by organization employees or bots, such as live chat assistance or responses based on past interactions with customers.

This kind of data mining is difficult without a stable data storage system like a data warehouse. It’s important to collect all information about your customers—including email, telephone calls, and social media posts—so it can be properly categorized and filed according to what products or services they use most often. This is essential for constructing profiles about each specific client which make up their unique digital identities, where all related information is stored within one instance.

Data Warehouses for Auditing and Compliance

Auditing and compliance checks are both labor-intensive tasks. Auditors need to look over spreadsheets of data, while compliance officers need to read through legal documents—tedious exercises that make keeping up with regulator demands difficult.

Data warehouses store electronic copies of important documents, saving time and money and reducing the rate of error and enabling more accurate analysis of the results. A good data warehouse will also have a structured storage format, so all relevant records can be retrieved instantly. This makes auditing faster and easier, while also making compliance easier because companies can quickly prove they’re in line with current regulations.

Learn more about compliance regulations for data storage systems . 

Data Warehouses for Data-Mining Analytics

Companies like Netflix base many business decisions on data-mining analytics, including which content is most popular, what promotional strategies work best, and which marketing campaigns resonate with subscribers. The data-mining analytics process stores massive amounts of data in a centralized location for easy analysis. Data warehouses are well-suited to data mining analytics, as they can store and make available the data necessary for insights as well as intellectual property and competitive intelligence.

Data Warehouses to Address Data Quality Issues

It’s important to promptly address errors and missed updates to avoid resulting in corrupt data or generating isolated silos, which can cause accuracy problems in analytics. One of data warehousing’s biggest benefits is that it enables business intelligence teams to act on errors in their databases.

Instead of manually correcting each error as it pops up, these tasks can be automated using extract, transform, load (ETL) tools like Informatica or Talend. For example, you could use SQL Server Integration Services (SSIS) to compare customer records with shipping records, and if a problem occurs—for instance, if one person receives multiple shipments from different addresses—you could fix it by adjusting an existing master record or creating a new one.

Data warehousing makes such fixes possible because it lets companies track and update data in large volumes over time, so errors don’t pile up and go unnoticed. And once a data warehouse is set up, IT departments can add functionality with minimal effort—no need to reinvent data systems when regulations change or when new uses arise for their data. By taking advantage of built-in data management features when necessary, IT professionals also spend less time trying to patch together ad hoc solutions.

Data Warehouses for RTDW Processing

Real-time data warehousing, or RTDW, refers to the instantaneous processing of all enterprise data for analysis as soon as it enters an organization’s information system. This effectively reduces or eliminates costly and time-consuming post-processing long data backlogs. Here are the major benefits of RTDW that help enterprises derive better business results:

  • Instant decision-making support to line of business users and customer service personnel
  • More accurate predictions and forecasts
  • Better data governance and security with fewer updates and reconciliations
  • Improved data quality through real-time validation, quality assurance, and error checking—data is continuously cleaned, updated, and validated
  • Streamlined operations, which can help identify inefficiencies and improve process optimization
  • Reduced costs through predictive analytics and automated diagnostic reporting
  • Reduced manual processing errors through early detection and resolution
  • Increased operational efficiency with advanced high-speed data retrieval
  • Improved customer service and satisfaction through real-time responses to customer behavior and patterns 
  • Risk mitigation through faster issue-responses
  • Reduced capital expenditure through efficient resource usage
  • Augmented business agility and resiliency through reduced dependence on manual processing

Logistics and manufacturing are two industries where real-time data warehousing can have a big impact on operations. For example, a manufacturer may want to know about a faulty component as soon as it is installed to initiate a recall or initiate preventive measures, or logistics providers could analyze shipment data to better prepare for demand spikes and optimize routes.

Data Warehousing for Big Data Analysis

Organizations dealing with large volumes of data—internet-based businesses that process millions of credit card transactions every month, for example—need to manage all that information. Data Warehouses are specifically designed to deal with massive amounts of data quickly and reliably, which makes them an essential tool for analysis purposes.

Traditional data processing systems like relational databases simply can’t cope with such quantities of data. They also lack necessary features such as security and database indexing, which significantly increases latency times during both writing and reading operations.

Also read: Top Big Data Tools & Software 2021

Data Warehouses for Data-Driven Decision-Making

Data warehouse solutions make it possible to make critical business decisions based on new insights from your company’s historical data. You can then use your new knowledge to inform big-picture plans, such as where to focus marketing efforts or what products and services to develop.

For example, consider the University of St. Andrews , where student administrators relied heavily on data warehouse and reporting systems to generate insights into student data. Keeping data on more than 10,000 students created numerous problems with the school’s legacy systems. It implemented a hybrid architecture and approach to its system, allowing staff to analyze student data on-demand and implement the flexibility for future upgrades and developments.

Data Warehouses for Business Intelligence

For true end-to-end system visibility, enterprises need a BI platform that can act as a hub for all of their structured and unstructured data. An online transaction processing (OLTP) data warehouse is great for storing transactional data at high volumes, but not optimized for business intelligence—depending on how the OLTP and BI systems are designed, they may not even integrate. Alternatively, online analytical processing (OLAP) systems are optimized for fast data processing and analysis, enabling businesses to promptly and easily pull insights from large amounts of data, identifying patterns and trends in order to inform business decisions.

An OLAP data warehouse will provide better access to important information in real time and help simplify complex data queries by consolidating critical data in one place. If you already have an operational data store in place but want to go further with your big data strategy, then building out a scalable business intelligence platform is key to moving forward with information discovery efforts across an enterprise.

Bottom Line: Data Warehouses for Enterprises

Data warehousing allows businesses to understand past data performance to develop effective plans and provides historical information they can refer back to later when making important business decisions. Data warehouses are designed to store massive amounts of structured and unstructured data for analysis and business intelligence, providing a holistic and historical record and serving as the enterprise’s “single source of truth.” A successful data warehouse strategy helps businesses understand exactly where they stand today and set measurable benchmarks that can drive long-term growth.

Read next: Enterprise Data Storage Compliance Guide

Anina Ot

Related Articles

What is fibre channel over ethernet (fcoe), what is hyperconverged storage uses & benefits, best enterprise hard drives for 2023, get the free newsletter.

Subscribe to Cloud Insider for top news, trends, and analysis.

Latest Articles

15 software defined storage best practices, 9 types of computer memory defined (with use cases).

Logo

Introduction to data warehouses: use cases, design and more

Have you ever felt like you're drowning in data but starving for insights? As organizations collect more and more data from various systems and sources, making sense of it all to drive better business decisions is a key challenge. All those numbers, metrics, and transactions flying around your business hold keys that can unlock big-time insights, if only you could collect it all in one place and organize it.

That's why smart companies build data warehouses – central repositories updated on a regular basis, collecting data scattered across departments and databases. With the right data foundation powering detailed analytics, powerful data science and business intelligence efforts, companies can shift from reactive responses to proactive, insight-led strategies that create competitive advantages and shareholder value.

In this article, we give you insights into building a data warehouse that transforms a digital deluge into strategic insights that propel business growth. The era of data warehousing isn't on the horizon; it's already here. And for businesses aiming to thrive in this new landscape, the message is clear: the time to embrace data warehousing is now.

What is a data warehouse?

What is data warehousing, and why is it critical in today’s businesses? Simply put, a data warehouse is a specialized database optimized for analytical queries rather than transaction processing. It serves as a central store for historical and current data from multiple sources, structured for querying and analysis. It is a treasure chest for business analysts, data engineers, data scientists, and decision-makers who rely on business intelligence tools, SQL clients, and analytics applications to extract meaningful insights.

Some key characteristics that distinguish a data warehouse:

  • Integrated data from multiple systems and sources – e.g. combining sales data from CRM systems, web analytics data, and inventory data from ERP systems,
  • Current and historical data – retains time-variant data over extended periods to enable trend analysis, for example, maintaining 5+ years of historical data even when source systems purge older data,
  • Read-only – data is loaded and stored but not updated or deleted within the warehouse ensuring reliability as a consistent analytic data source,
  • Organized by relevant subjects – and focused on business analysis rather than operations, such as sales, inventory or production analytics, or similar.

In short, a data warehouse powers business intelligence by enabling users to easily query large volumes of high quality, integrated data for reporting and advanced analytics.

What is a cloud data warehouse?

Exploring what a data warehouse is reveals its role as a fully managed service offered in the cloud, known for their scalability and flexibility. One of the leading options is AWS Amazon Redshift , which offers extensive data warehousing and analytical capabilities. 

case study of data warehouse

Advantages of a cloud data warehouse vs an on-premise data warehouse

What does data warehousing allow organization to achieve? Migrating your on-premise data warehouse to the cloud offers several compelling benefits including:

  • Elastic scalability – scale storage and compute up and down as needed, from a few terabytes to petabytes and back again, without any hardware capacity planning,
  • No infrastructure to setup and manage – fully managed service without need for database admin and ops teams to install, maintain and tune infrastructure,
  • Usage-based pricing – pay only for storage, compute and services used per hour or month rather than upfront hardware costs,
  • Faster time to value – get up and running quickly without lengthy on-prem hardware procurement cycles.

For example, with a cloud data warehouse like Amazon Redshift, ad-hoc analysis of billions of rows of data can be supported instantly with no upfront hardware investments required.

While on-premise data warehouses provide more control and security, they often struggle with complexity and scalability. For businesses requiring rapid scaling, cloud data warehouses can be more responsive to these needs.

How does a data warehouse work?

Behind the scenes, a data warehouse is composed of an automated data integration architecture for loading and organizing data, and a structured database optimized for analytical performance. Data within a warehouse is methodically organized into databases, tables, and columns, facilitating efficient data management. 

Data warehouse architecture

Cloud-native data warehouse platforms that handle the infrastructure, management, and scaling of analytical data workloads in the cloud typically provide data warehouse architecture that consists of:

Staging layer

The staging layer acts as a temporary landing zone where raw data gets extracted from source systems before loading it into the data warehouse. It handles essential data integration tasks like validation, quality checks, transformation, and data conversion to prepare the source data for analytic use.

For example, log files or database backups may get staged here for processing.

Core data warehouse

‍ This analytical database is where clean, integrated enterprise data is stored in an analytics-optimized structure designed for flexibility, scalability and fast query performance. Various storage schema designs of data warehouse systems can be utilized to structure the data, like star or snowflake schemas separating business facts from dimensional attributes. Columnar storage, compression, partitioning, caching and query optimization help provide fast analysis over huge data volumes, even with petabytes of historic data.

Access layer

‍ The presentation layer consisting of BI tools, SQL clients, analytics dashboards and other applications that analysts, data scientists and business users leverage to access and analyze the data. This abstraction layer hides complex underlying data structures while exposing intuitive business entities. Security integration and access controls also happen here.

Together these cloud data warehouse layers allow scalable and governed pipelines to move raw data from on-prem systems into a cloud analytics-ready state where it fuels advanced analysis and business insights.

This architecture provides flexible and secure data warehouse management, and can be customized based on organizational requirements. From simple structures to complex models with special staging areas, these architectures allow businesses to adapt their data warehouses to specific operational needs.

How do databases, data warehouses, and data lakes work together?

When further customization is required, modern analytics architectures offer a powerful combination of transactional source systems, data lakes, and warehouses. Together, they form a comprehensive ecosystem for managing diverse data requirements:

  • Transactional databases. Databases usually handle real-time operational data. They are systems of record capturing business events and transactions (e.g. a retail bank's core banking system tracking deposits, payments, and account updates),
  • Data warehouses. In the context of big data, what is a data warehouse becomes crucial as it serves as a central integrated data repository specialized for analytics (e.g. an aggregate view of retail banking transactions, balances, and customer details),
  • Data lakes. Scalable “landing pads” for vast amounts of raw data both structured and unstructured (e.g. clickstream events from a bank's website funneling into a Hadoop data lake), acting as an expansive repository storing data in its original format.

When considering upgrades to our data systems, it's essential to revisit what is a data warehouse, database and lake, and how its latest advancements can benefit our analytics. Together these systems enable real-time operations to feed raw data efficiently into long-term storage, where ETL processes (Extract, Transform, and Load) structure and prepare data for analysis within the optimized data warehouse where it unlocks value.

For example, a business may use a database for daily operations, a data lake for storing raw data from transactional systems, and a data warehouse for in-depth analysis of data reshaped, cleaned, and integrated in batch ETL processes.

What are the benefits of a data warehouse?

Now that we understand what data warehousing entails and how it works, let’s highlight some of the tangible benefits:

  • Deeper business insights – by structuring integrated enterprise data focused specifically around business analysis, hidden insights can be uncovered to inform strategic decisions, ‍

For example, analyzing multidimensional sales trends by region, product, sales rep can result in more granular understanding of a particular market.

  • Improved analytics performance – orders of magnitude faster query speeds compared to transaction databases, e.g. sub-second response for ad-hoc queries involving billions of rows, to enable interactive analysis without high latency impacting production systems.
  • Data governance and QA – data warehousing ETL processes improve data quality, consistency, accuracy, and governance enforced on source data before final storage, ‍

Validating records, transforming disparate coding, flagging outliers or errors are standard procedures that can be performed while using data warehouse.

  • Historical trend analysis – retains extensive time series data beyond what even source systems store for longitudinal and historical analysis, ‍

For instance, analyzing account growth and retention cohorts over a 5+ year period can translate to more informed customer engagement strategies, providing insights into customer behavior and helping to tailor marketing efforts for improved loyalty and retention.

  • Self-service BI and agility – broad user access and flexible tools empower various teams to gain insights from data independently without being gated by IT or engineering resources.

The ultimate end goal of data warehouse investment is empowering more data-driven decision making at an organizational level based on accurate analytics leveraging your full domain of enterprise data.

Data warehouse use cases

Nearly any business looking to unlock value through deeper analysis of its data is a candidate for data warehousing. Some examples where data warehousing delivers immense value include:

  • Sales analytics – companies can leverage data warehousing and business intelligence capabilities for deeper sales analytics to uncover trends, optimize operations, and improve forecasting. By integrating data from CRM systems, financial records, and other sources, sales leaders can track KPIs like pipeline trends, win/loss metrics, deal velocity, and forecast accuracy down to granular segment and account levels. This enables data-driven decisions around resource allocation, quota setting, and sales process optimization,
  • Customer analytics – data warehousing provides a 360-degree customer intelligence capability by consolidating data from across disparate channels and systems. Analytics use cases span behavioral segmentation for precision targeting, analyzing customer lifetime value trends, building predictive models to estimate renewal and churn risks, tracking multi-touch attribution, and optimizing customer marketing through campaign measurement and ROI analysis. These data-driven insights support strategic decision making around resource allocation, pricing optimization, product enhancement priorities and customer experience management,
  • Human resources – HR teams can leverage analytics use cases such as advanced workforce planning leveraging historical hiring and attrition models, analyzing training program efficacy, ensuring pay equity across segments, and building predictive models for talent retention risks. By understanding key workforce metrics and trends better, HR leaders are empowered to optimize hiring and talent management programs, minimize regrettable turnover in key roles, and align learning & development investments to drive productivity,
  • Financial planning – data warehousing enables finance teams to consolidate data from operations, sales, HR and other functions for deeper analysis in support of long term planning. Use cases include driver-based budget forecasting by business unit, dynamic what-if modeling and scenario analysis, analyzing true profitability by customer segment, product line or region, and optimizing cost structure through granular cost & profitability reporting. The insights help guide operational planning, investment decisions and growth strategy.

Having explored common business analytics use cases powered by data warehousing, let's shift to some leading practices for architecting your own data warehouse solution tailored to your organization's needs. While specific components depend heavily on your industry, data systems landscape and expected query patterns, these general guidelines help structure tactical design decisions.

How to design a data warehouse for your specific use case, step-by-step

Key considerations when architecting a data warehouse for a specific use case include:

Step 1: Integrate key data sources

A fundamental step is identifying critical systems of record across your organization to integrate into the data warehouse, such as your CRM and ERP databases, web or mobile analytics tools, IoT data streams, and other transactional and operational systems.

For instance, AWS offers native connectivity from Redshift to common data sources like S3, DynamoDB, RDS, Salesforce etc. Ingesting this raw data becomes the foundation for transformed datasets serving various analytics use cases.

Step 2: Enable self-service access

Structure curated data sets tailored for business teams containing clean, business-friendly views they can self-serve for analysis without deep technical skills.

For example, Redshift Spectrum allows querying exabytes of unstructured S3 data , while Redshift fuctions help present calculated metrics. Combine disparate data into reusable semantic models using AWS services like Quicksight, Sagemaker or third-party BI tools.

Step 3: Ensure scalability

Data warehouses deliver petabyte-scale storage capacity, easily scaling on demand. The distributed query processing architecture optimizes performance across compute clusters spanning dozens of nodes. Moreover, you pay only for the managed infrastructure used per hour with no upfront costs.

Step 4: Combine batch & streaming

Modern data warehouses can integrate both batch and real-time streaming data sources. Batch data pulled from systems of record on periodic extracts provides in-depth historic context. Integrating streaming data from message buses, IOT devices or transactional logs adds a real-time dimension enabling up-to-the minute operational insights. Your warehouse architecture should support ingesting high volume event streams as well as retaining large historic datasets, under one analytics foundation.

For example, Redshift integrates batch historic data from S3 and databases with streaming sources like Kafka and Kinesis using services like Firehose, enabling real-time dashboards over fresh data while retaining large historical datasets for flexible ad-hoc analysis.

Step 5: Facilitate data science

Plan to enable more advanced data manipulation, statistical analysis and machine learning model development against warehouse data. Provide tools and access for data scientists to transform raw data into features, build training datasets, even build and execute machine learning models with native scoring integration. Optimized data types for complex analytics functions processing large data volumes with high cardinality can enable sophisticated enterprise AI/ML without requiring data to move elsewhere.

If you decide to use Redshift, it is good to know that it integrates smoothly with Sagemaker allowing data scientists to manipulate datasets for advanced analytics at scale. ML models trained in Sagemaker can also be operationalized within Redshift SQL queries. Optimized types like hyperloglog algorithms enable complex analytics functions over extremely large data.

The key is flexible data warehousing architectures to facilitate varied analytics use cases – from parameterized dashboards to canned reports and ad-hoc analysis to advanced modeling. Different query patterns should be supported for business analysts, data engineers, data scientists and application developers against a consolidated 360-degree data view.

How can RST Software support your data warehouse efforts

Businesses often ask what a data warehouse is to understand how it can enhance their data analysis and reporting capabilities. As data management experts supporting clients across industries in leveraging AWS, RST Software is uniquely equipped to advise, architect, and implement your cloud data warehousing initiatives. Our team can help at any point along your data journey:

  • Advisory on leading data warehousing approaches and best practices,
  • Technical architecture and design services,
  • Build and deployment services on AWS cloud platforms.

Whether you are looking to set up your first cloud data warehouse or migrate from legacy data warehouses, reach out to us and let's discuss how RST can help to transform your enterprise data into business value.

People also ask

Want to read more.

case study of data warehouse

Introduction to data-centric AI

case study of data warehouse

Introduction to Data Lakes – how to deploy them in the cloud?

case study of data warehouse

How to write a software development RFP (Request for Proposal)

10 Benefits and Use Cases of a Data Warehouse

A recent  IDC DataSphere  forecast report predicts that the compound annual growth rate of  global data creation and replication  will reach 23% between 2020 and 2025.

Another study suggests global data creation will grow to over  180 zettabytes  during that same period.

Cheaper data storage and advanced analytics technologies are contributing to the current data explosion. But aggregating that data into a single place where you can easily analyze it remains a complex task.

With data trapped in isolated systems across an organization, teams struggle to access accurate, consistent data from the multiple analytics and  ETL tools  being used.

Fortunately, organizations can  use a data warehouse  to collect, organize, and analyze data on demand.

The role of data warehousing

Data warehousing consolidates large amounts of data from multiple sources and optimizes it to enable analysis for improving business efficiency, making better decisions, and discovering competitive advantages.

Note that a data warehouse is  not the same  as a database.

While both are relational data systems, a database uses  online transaction processing (OLTP) to store current transactions and enables fast access to specific transactions for ongoing business processes.

On the other hand, data warehouses store large quantities of historical data and support fast, complex queries across all data using online analytical processing (OLAP) .

This article will examine the  benefits of a data warehouse  and offer  use cases where such a system could add value to your business.

Data warehouse benefits

A successfully implemented data warehouse can help your organization in several ways. Some of the benefits of a data warehouse include:

1. Consistency

Data warehousing typically involves converting data from multiple sources and formats into one standard format, making it easier for users to analyze and share insights on the entire collection of data.

More consistent data means that individual business departments such as marketing, sales, and finance can use the same data resource for queries and reports to produce results consistent with the other departments.

2. Centrality

Most organizations need to merge data from multiple subsystems built on different platforms to perform valuable business intelligence. Data warehousing solves this problem by consolidating data into a single repository, making all the organization’s data available from a centralized location.

Data warehousing improves end-user access to a wide range of enterprise data.

In many cases, business users and decision-makers have to log into every individual department system and manually consolidate data or request reports through IT personnel to get the data they need. Using a data warehouse, business users can generate reports and queries on their own.

Users can access all the organization’s data from one interface instead of having to log into multiple systems. Easier access to data means less time spent on data retrieval and more time on data analysis.

4. Auditability

The goal of a data warehouse is to ensure that data is accurate, current, and accessible —which is also the goal of the auditing process.

The use of a data warehouse can ensure data integrity through implemented controls for roles and responsibilities related to extracting data from source systems and migrating to the data warehouse.

Security controls implemented within the data warehouse ensure that users only have read access to data.

5. Data sanitization

When data gets integrated from multiple systems, it can become inconsistent because of incomplete, duplicated, or redundant information. If the data is not cleansed or corrected, these errors could reflect in queries and reports, leading to inaccurate insights.

Data warehouses use a sanitization process to eliminate poor-quality information from the data repository. The method detects duplicate, corrupt, or inaccurate data sets, then replaces, modifies, or deletes records to ensure data integrity and consistency.

Use cases for a data warehouse

The following use cases demonstrate how you can use a data warehouse in your organization.

1. Marketing/sales campaign effectiveness

Marketing data can get scattered across multiple systems in an organization, including customer relationship management systems and sales systems. By the time teams pull together scattered data into spreadsheets to calculate important metrics, the data may have become outdated.

A marketing data warehouse creates a single source of data from which the marketing team can operate. In addition, you can merge data from systems within the organization and external systems such as web analytics platforms, advertising channels, and CRM platforms.

With a data warehouse, all marketers have access to the same standardized data , allowing them to execute faster, more efficient initiatives. Teams can generate more granular insights and better track performance metrics such as ROI, lead attribution, and customer acquisition costs.

Data warehouses can also process data in real-time, enabling marketers to build campaigns around the most recent data to generate more leads and business opportunities.

seamlessly-update-your-dashboards-and-bi-tools

2. Team performance evaluations

Data warehouses can help evaluate team performance across the organization. Users can dig deeper into team data to create customized dashboards or reports, showing team performance based on specific metrics.

Metrics derived from the data warehouse, such as usage patterns, customer lifetime value, and acquisition sources, can be used to evaluate customer service, sales, and marketing teams, respectively.

In addition, combined data sets from other business areas can also highlight how teams have contributed to overall business performance and objectives.

3. IoT data integration

Internet of Things (IoT) devices, or network-connected devices like smartwatches, kitchen appliances, and security devices, generate vast amounts of data that you can analyze to improve systems and processes .

This data must be collected and stored in relational formats to support historical and real-time analysis. Then, instant queries are performed against millions of events or devices to discover real-time anomalies or predict events and trends from historical data.

IoT data analysis requires a high-performance, easy-to-access platform that’s flexible enough to respond immediately to changing conditions. This data can be summarized and filtered into fact tables with a data warehouse to create time-trended reports and other metrics.

automatically-store-raw-data

4. Merging data from legacy systems

Legacy data  is information stored in an old format or obsolete systems, making it difficult to access and process. Unfortunately, many businesses still rely on mainframe environments and other legacy application systems despite technological advancements in platforms, architectures, and tools.

One reason is that these systems have captured business knowledge and rules that are difficult to migrate to newer platforms and applications over the years. But the information within legacy systems can be a valuable data resource for analytical systems.

Legacy systems were built to perform specific functions and did not get constructed to analyze data. As a result, companies that run core functions on a mainframe or other legacy software don’t have timely access to core transactional data for real-time information.

Gaining access to data locked away within legacy systems can be pivotal to solving business problems and can help you discover trends you might not be able to see with newer data.

Data warehouses can automatically connect to legacy systems to collect and analyze data. Using ETL, data warehouses can transform data from legacy systems into a format that newer applications can use.

Merging legacy data with new applications can help provide greater insight into historical trends, leading to more accurate business decisions.

connect-all-your-data-sources

5. Analyzing large stream data

Large data streaming is a method that processes, you guessed it, large streams of real-time data to extract insights and useful trends. A continuous stream of unstructured data is analyzed before it gets stored to disk, and the value of the data can decrease if not processed immediately.

Processing occurs at high speeds across a cluster of servers in real-time; data cannot get reanalyzed once streamed.

Large stream data is continuously generated by multiple sources. The data can vary widely from a mobile device or web application log files to in-game player activity, social media information, and e-commerce purchases. Processed data gets used for several analytical purposes, such as aggregations, filtering, correlations, and sampling.

Data analysis performed on large stream data gives businesses insight into business and customer activities such as service usage, website clicks, device geolocation, and server activity.

A data warehouse can group large stream data to show its overall statistics. For example, a delivery company collects delivery event data that is sessionized to determine overall statistics for delivery times and the distance traveled.

The many benefits of using a data warehouse are evident in the above use cases, including:

  • Streamlined information flow
  • Enhanced data quality and consistency,
  • Improved business intelligence
  • Significant competitive advantage
  • Improved decision making

Organizations that capture the full benefits of data are better equipped to handle changing market conditions and evolving customer requirements. As a result, data warehousing can offer great value to businesses to centralize and create more consistent data that’s easier for business users to access.

And as you’ve seen, data warehouses can be beneficial in several business scenarios, including marketing campaigns, IoT data integrations, and analyzing large stream data.

If you have complicated data requirements, a data warehouse can make things easier. With next-generation data warehousing tools like  Panoply , you can  connect all your data to a central data warehouse, reducing the time needed to get the most out of your data.

Also Check Out

Get panoply updates on the fly., work smarter, better, and faster with monthly tips and how-tos..

Illustration with collage of pictograms of clouds, pie chart, graph pictograms

A data warehouse, or enterprise data warehouse (EDW), is a system that aggregates data from different sources into a single, central, consistent data store to support data analysis, data mining, artificial intelligence (AI) and machine learning .

A data warehouse system enables an organization to run powerful analytics on large amounts of data (petabytes and petabytes) in ways that a standard database cannot.

Data warehousing systems have been a part of business intelligence (BI) solutions for over three decades, but they have evolved recently with the emergence of new data types and data hosting methods. Traditionally, a data warehouse was hosted on-premises—often on a mainframe computer—and its functionality was focused on extracting data from other sources, cleansing and preparing the data, and loading and maintaining the data in a relational database. More recently, a data warehouse might be hosted on a dedicated appliance or in the cloud, and most data warehouses have added analytics capabilities and data visualization and presentation tools.

Learn the building blocks and best practices to help your teams accelerate responsible AI.

Register for the ebook on Presto

Generally speaking, data warehouses have a three-tier architecture, which consists of a:  

Bottom tier :  The bottom tier consists of a data warehouse server, usually a relational database system, which collects, cleanses, and transforms data from multiple data sources through a process known as Extract, Transform, and Load (ETL) or a process known as Extract, Load, and Transform (ELT). For most organizations that use ETL, the process relies on automation, and is efficient, well-defined, continuous and batch-driven.  

Middle tier :  The middle tier consists of an OLAP (online analytical processing) server which enables fast query speeds. Three types of OLAP models can be used in this tier, which are known as ROLAP, MOLAP and HOLAP. The type of OLAP model used is dependent on the type of database system that exists.  

Top tier :  The top tier is represented by some kind of front-end user interface or reporting tool, which enables end users to conduct ad-hoc data analysis on their business data.

Most data warehouses will be built around a relational database system, either on-premise or in the cloud, where data is both stored and processed. Other components would include a metadata management system and an API connectivity layer enabling the warehouse to pull data from organizational sources and provide access to analytics and visualization tools.

A typical data warehouse has four main components: a central database, ETL tools, metadata and access tools. All of these components are engineered for speed so that you can get results quickly and analyze data on the fly.

The data warehouse has been around for decades. Born in the 1980s, it addressed the need to optimize analytics on data. As companies’ business applications began to grow and generate/store more data, they needed data warehouse systems that could both manage the data and analyze it. At a high level, database admins could pull data from their operational systems and add a schema to it via transformation before loading it into their data warehouse.

As data warehouse architecture evolved and grew in popularity, more people within a company started using it to access data–and the data warehouse made it easy to do so with structured data. This is where metadata became important. Reporting and dashboarding became a key use case, and SQL (structured query language) became the de facto way of interacting with that data.

Let's take a closer look at each component.

When database analysts want to move data from a data source into their data warehouse, this is the process they use. In short, ETL converts data into a usable format so that once it’s in the data warehouse, it can be analyzed/queried/etc. 

Metadata is data about data. Basically, it describes all of the data that’s stored in a system to make it searchable. Some examples of metadata include authors, dates or locations of an article, create date of a file, the size of a file, etc. Think of it like the titles of a column in a spreadsheet. Metadata allows you to organize your data to make it usable, so you can analyze it to create dashboards and reports.

SQL is the de facto standard language for querying your data. This is the language that analysts use to pull out insights from their data stored in the data warehouse. Typically data warehouses have proprietary SQL query processing technologies tightly coupled with the compute. This allows for very high performance when it comes to your analytics. One thing to note, however, is that the cost of a data warehouse can start getting expensive the more data and SQL compute resources you have.

The data layer is the access layer that allows users to actually get to the data. This is typically where you’d find a data mart . This layer partitions segments of your data out depending on who you want to give access to, so you can get very granular across your organization. For instance, you may not want to give your sales team access to your HR team’s data, and vice versa.

This is related to the data layer in that you need to be able to provide fine-grained access and security policies across all your organization’s data. Typically data warehouses have very good data governance and security capabilities built in, so you don’t need to do a lot of custom data engineering work to include this. It’s important to plan for governance and security as you add more data to your warehouse and as your company grows.

While access tools are external to your data warehouse, they can be seen as its business-user friendly front end. This is where you’d find your reporting and visualization tools, used by data analysts and business users to interact with the data, extract insights and create visualizations that the rest of the business can consume. Examples of these tools include Tableau, Looker and Qlik.

OLAP (online analytical processing) is software for performing multidimensional analysis at high speeds on large volumes of data from unified, centralized data store, such as a data warehouse. OLTP (online transactional processing) , enables the real-time execution of large numbers of database transactions by large numbers of people, typically over the internet. The main difference between OLAP and OLTP is in the name: OLAP is analytical in nature, and OLTP is transactional. 

OLAP tools are designed for multidimensional analysis of data in a data warehouse, which contains both historical and transactional data. Common uses of OLAP include data mining and other business intelligence apps, complex analytical calculations, and predictive scenarios, as well as business reporting functions like financial analysis, budgeting, and forecast planning.

OLTP is designed to support transaction-oriented applications by processing recent transactions as quickly and accurately as possible. Common uses of OLTP include ATMs, e-commerce software, credit card payment data processing, online bookings, reservation systems, and record-keeping tools.

For a deep dive into the differences between these approaches, check out " OLAP vs. OLTP: What's the Difference? " 

Schemas are ways in which data is organized within a database or data warehouse. There are two main types of schema structures, the star schema and the snowflake schema, which will impact the design of your data model .

Star schema:  This schema consists of one fact table which can be joined to a number of denormalized dimension tables. It is considered the simplest and most common type of schema, and its users benefit from its faster speeds while querying.

Snowflake schema:  While not as widely adopted, the snowflake schema is another organization structure in data warehouses. In this case, the fact table is connected to a number of normalized dimension tables, and these dimension tables have child tables. Users of a snowflake schema benefit from its low levels of data redundancy, but it comes at a cost to query performance. 

Data warehouse, database, data lake , and data mart are all terms that tend to be used interchangeably. While the terms are similar, important differences exist:

Data warehouse vs. data lake  

Using a data pipeline , a data warehouse gathers raw data from multiple sources into a central repository, structured using predefined schemas designed for data analytics. A data lake is a data warehouse without the predefined schemas. As a result, it enables more types of analytics than a data warehouse. Data lakes are commonly built on big data platforms such as Apache Hadoop.

Data warehouse vs. data mart  

A data mart is a subset of a data warehouse that contains data specific to a particular business line or department. Because they contain a smaller subset of data, data marts enable a department or business line to discover more-focused insights more quickly than possible when working with the broader data warehouse data set.

Data warehouse vs. database  

A database is built primarily for fast queries and transaction processing, not analytics. A database typically serves as the focused data store for a specific application, whereas a data warehouse stores data from any number (or even all) of the applications in your organization.

A database focuses on updating real-time data while a data warehouse has a broader scope, capturing current and historical data for predictive analytics, machine learning, and other advanced types of analysis.

Cloud data warehouse  

A cloud data warehouse is a data warehouse specifically built to run in the cloud, and it is offered to customers as a managed service. Cloud-based data warehouses have grown more popular over the last five to seven years as more companies use cloud computing services and seek to reduce their on-premises  data center  footprint.

With a cloud data warehouse, the physical data warehouse infrastructure is managed by the cloud company, meaning that the customer doesn’t have to make an upfront investment in hardware or software and doesn’t have to manage or maintain the data warehouse solution.

Data warehouse software (on-premises/license)  

A business can purchase a data warehouse license and then deploy a data warehouse on their own on-premises infrastructure. Although this is typically more expensive than a cloud data warehouse service, it might be a better choice for government entities, financial institutions, or other organizations that want more control over their data or need to comply with strict security or data privacy standards or regulations.

Data warehouse appliance  

A data warehouse appliance is a pre-integrated bundle of hardware and software—CPUs, storage, operating system, and data warehouse software—that a business can connect to its  network  and start using as-is. A data warehouse appliance sits somewhere between cloud and on-premises implementations in terms of upfront cost, speed of deployment, ease of scalability, and data management control.

A data warehouse provides a foundation for the following:

  • Better data quality:  A data warehouse centralizes data from a variety of data sources, such as transactional systems, operational databases, and flat files. It then cleanses the operational data, eliminates duplicates, and standardizes it to create a single source of the truth.
  • Faster, business insights:  Data from disparate sources limit the ability of decision makers to set business strategies with confidence. Data warehouses enable data integration, allowing business users to leverage all of a company’s data into each business decision. Data warehouse data makes it possible to report on themes, trends, aggregations, and other relationships among data collected from an engineering lifecycle management (ELM) app.
  • Smarter decision-making:  A data warehouse supports large-scale BI functions such as data mining (finding unseen patterns and relationships in data), artificial intelligence, and machine learning—tools data professionals and business leaders can use to get hard evidence for making smarter decisions in virtually every area of the organization, from business processes to financial management and inventory management.
  • Gaining and growing competitive advantage:  All of the above combine to help an organization finding more opportunities in data, more quickly than is possible from disparate data stores.

As companies start housing more data and needing more advanced analytics and a wide range of data, the data warehouse starts to become expensive and not so flexible. If you want to analyze unstructured or semi-structured data, the data warehouse won’t work. We’re seeing more companies moving to the data lakehouse architecture, which helps to address the above. The open data lakehouse allows you to run warehouse workloads on all kinds of data in an open and flexible architecture. This data can also be used by data scientists and engineers who study data to gain business insights. Instead of a tightly coupled system, the data lakehouse is much more flexible and also can manage unstructured and semi-structured data like photos, videos, IoT data and more.

The data lakehouse can also support your data science, ML and AI workloads in addition to your reporting and dashboarding workloads. If you are looking to upgrade from data warehouse architecture, then developing an open data lakehouse is the way to go.

IBM data warehouse solutions offer performance and flexibility to support structured and unstructured data for analytics workloads including machine learning.

Explore the capabilities of a fully managed, elastic cloud data warehouse built for high-performance analytics and AI.

IBM Cloud Pak® for Data is a modular set of integrated software components for data analysis, organization and management across business silos, on premises and in clouds.

AI can present a number of challenges that enterprise data warehouses and data marts can help overcome. Discover how to assess the total value such a solution can provide.

To choose an enterprise data warehouse, businesses should consider the impact of AI, key warehouse differentiators, and the variety of deployment models. This ebook helps do just that.

A guide to building a data-driven organization and driving business advantage.

Scale AI workloads for all your data, anywhere, with IBM watsonx.data, a fit-for-purpose data store built on an open data lakehouse architecture.

Data Topics

  • Data Architecture
  • Data Literacy
  • Data Science
  • Data Strategy
  • Data Modeling
  • Governance & Quality
  • Education Resources For Use & Management of Data

Case Study: Cornell University Automates Data Warehouse Infrastructure

Cornell University is a privately endowed research university founded in 1865. Ranked in the top one percent of universities in the world, Cornell is made up of 14 colleges and schools serving roughly 22,000 students. Jeff Christen, data warehousing manager at Cornell University and adjunct faculty in Information Science, and Chris Stewart, VP and general […]

case study of data warehouse

Cornell University is a privately endowed research university founded in 1865. Ranked in the top one percent of universities in the world, Cornell is made up of 14 colleges and schools serving roughly 22,000 students.

case study of data warehouse

The Primary Issue

Cornell was using Cognos Data Manager to transform and merge data into an Oracle Data Warehouse. IBM purchased Data Manager and decided to end support for the product. “Unfortunately, we had millions of lines of code written in Data Manager, so we had to shop around for a replacement,” said Christen. He looked at it as an opportunity to add new functionality so that their data warehouse ran more efficiently.

The Assessment

Christen’s IT team had to confine processing to hours when the university was closed, so batch processing from financial, PeopleSoft, or student records couldn’t start warehouse processing until the end of normal operations and had to be completely finished by 8:00 a.m. when staff arrived as they needed access to the warehouse.

“It was getting really close. We were frequently bumping into that time,” said Christen. Because their processing window was so short, errors and issues could be very disruptive.

“Our old tool would just log it if there was an issue, but then we couldn’t load the warehouse, because some network glitch that probably took seconds was enough to take out our nightly ETL processing,” elaborated Christen.

Outdated documentation was also a problem. Stewart said that they joke with their customers about documenting a data warehouse. “There are two types of documentation: nonexistent and wrong. People laugh, but nobody ever argues that point because it’s the thing that people don’t like to do, so it rarely gets done,” said Stewart.

Because it is an academic institution, licensing and staffing costs were important factors for Cornell. Stewart often sees this in government and in higher education organizations where the administration has increasing data needs, yet the pool of available people is small, like Christen’s staff of four.

Stewart said that automation can lift much of that workload so staff can get more accomplished in a shorter amount of time. “You can’t just go out and add two more people. If you have more work, you need to get more out of your existing staff,” said Stewart.

Finding a Solution

Christen started to shop around for ETL tools , with an eye to adding some improvements. There were several key areas he focused on when evaluating vendors: documentation, licensing costs, improving performance and being able to work within existing staffing levels. In 2014, Christen attended the Higher Education Data Warehousing conference to research options.

WhereScape was one of the exhibitors at the conference and one of the features that caught his attention was its approach to documentation. “Our customers were used to having outdated and incomplete documentation, and that was something WhereScape definitely had a handle on,” he said.

Most of the products Cornell considered required licensing by CPU, which could prove cost-prohibitive as Cornell’s extensive data warehouse environment was scaled for end-user query performance.

“We have a ton of CPUs,” Christen said. CPU-based licensing costs would be significant, and they found themselves trying to figure out how to re-architect the entire system to reduce the CPU footprint enough so that the licensing could work, a process that would create other limitations. WhereScape’s license model is a developer seat license, so with four full-time warehouse developers, they only needed to purchase four named user licenses.

“There’s no separate license for the CPU run-time environment with WhereScape, so if we’re successful, we’ll get everything converted, but there’s no penalty for how we configure the warehouse for end-user performance or query performance,” Christen said.

Being able to integrate and use the product without increasing the number of developers was a clear advantage. “That’s has been a key driver for organizations evaluating automation for their teams,” Stewart added.

Cornell didn’t just rely on marketing material to make their decision. They did an on-site proof of concept where one of their developers worked with the product on a portion of their primary general ledger model. They discovered that WhereScape was intuitive enough that one of their ETL developers was able to code a parallel environment in the proof of concept with minimal assistance from WhereScape. The developer hadn’t gone through any formal training, which proved that the learning curve would be manageable. \

The proof of concept allowed them to get a nearly apples-to-apples comparison, which showed “huge improvements” in load time performance compared to Data Manager. “So, it was a robust enough tool, but also intuitive enough that it could be mastered in a few weeks,” said Christen.

About WhereScape

WhereScape helps IT organizations of all sizes leverage automation to design, develop, deploy and operate data infrastructure faster.

“We realized long ago that there were patterns in data warehousing that really transcend any industry vertical or any size of company,” said Stewart.

Because the process of building a data warehouse out is primarily mechanical, and much of that is common among data warehousing organizations, WhereScape automates both the design and modeling of the data warehouse, all the way through to the physical build.

“Even deployments, as you’re moving a project from development to quality assurance environment (QA), and then on to production, we’re scripting all that out as well,” said Stewart. These are all processes companies usually use multiple tools to address – a resource-heavy process that can create a silo for each tool.

“We have one tool suite that covers data warehousing end-to-end and it’s just one set of tools to learn,” said Stewart. Instead of licensing separate tools to for each part of building a data warehouse, then finding a place to install all those tools, and spending weeks for staff training and management – teams have just one tool to learn and use. Handing off the build to WhereScape’s automated process frees up time and energy so that the business can take advantage of that data and produce useful analytics.

The initial wins of the conversion from their traditional ETL tool to WhereScape allowed Cornell to cut their nightly refresh times in half, or better, in some cases. Although they didn’t start that way, they are now a 100 percent WhereScape solution, with 100 percent Amazon-hosting as well.

“We did a major conversion which took a few years to get to WhereScape from our old tool, but that’s behind us. We’re running WhereScape on Amazon Web Services in their Oracle RDS service,” said Christen.

Although they just finished this conversion in the last year, since 2014 when they purchased WhereScape, all new developments and enhancements have been done in WhereScape.

“There’s actually an option to fix the problem, restart it, and still complete before business hours, which is a big win for our customers,” said Christen. “Essentially, we’ve cut our refresh times in half, so not only can the team complete all the processing they need with their batch windows, we’re not brushing up against business hours anymore.”

By automatically generating documentation, WhereScape solved the problem of outdated and incomplete documentation.

What’s Next?

To take full advantage of the automated documentation process, Cornell decided to build in some new subject areas, but the speed of the tool outstripped their internal modified waterfall approval process. Christen believes they can speed up their process now that they can quickly put out a prototype. They can start receiving feedback immediately from customers within days rather than weeks, and from there, refine the model until they’re ready for production.

“So, it’s changing our practices now that we have some new abilities with WhereScape,” said Christen. One of the next steps is to more fully leverage and market the documentation so they can start providing their customers with more information about the attributes that are available in the warehouse.

An unexpected benefit is that Christen’s Business Intelligence Systems students get to use WhereScape to learn Dimensional Data Modeling, ETL concepts, and Data Visualization hands-on with real datasets.

“We’re teaching the concepts of automation so they learn the hard way, with SQL statements, and then we use WhereScape and they can see how quickly they can create these structures to build out real dimensional model data warehouses,” explained Christen.

Stewart noted that they’ve had inquiries from other universities that have heard about Christen’s use of WhereScape in the classroom and are interested in incorporating WhereScape into their curriculum, so the students can get more work done in a semester.

“It’s a similar benefit to what our customers are receiving in their ‘real-world’ application of automation, and it is giving students the chance to understand the full data warehousing lifecycle,” said Stewart.

Image used under license from Shutterstock.com

Leave a Reply Cancel reply

You must be logged in to post a comment.

ExLibris Esploro

Loading metrics

Open Access

Good practices for clinical data warehouse implementation: A case study in France

* E-mail: [email protected]

Affiliations Mission Data, Haute Autorité de Santé, Saint-Denis, France, Inria, Soda team, Palaiseau, France

ORCID logo

Affiliation Mission Data, Haute Autorité de Santé, Saint-Denis, France

Affiliations Univ. Lille, CHU Lille, ULR 2694—METRICS: Évaluation des Technologies de santé et des Pratiques médicales, Lille, France, Fédération régionale de recherche en psychiatrie et santé mentale (F2RSM Psy), Hauts-de-France, Saint-André-Lez-Lille, France

Affiliation Sorbonne Université, Inserm, Université Sorbonne Paris-Nord, Laboratoire d’informatique médicale et d’ingénierie des connaissances en e-Santé, LIMICS, France

  • Matthieu Doutreligne, 
  • Adeline Degremont, 
  • Pierre-Alain Jachiet, 
  • Antoine Lamer, 
  • Xavier Tannier

PLOS

Published: July 6, 2023

  • https://doi.org/10.1371/journal.pdig.0000298
  • Reader Comments

29 Sep 2023: Doutreligne M, Degremont A, Jachiet PA, Lamer A, Tannier X (2023) Correction: Good practices for clinical data warehouse implementation: A case study in France. PLOS Digital Health 2(9): e0000369. https://doi.org/10.1371/journal.pdig.0000369 View correction

Fig 1

Real-world data (RWD) bears great promises to improve the quality of care. However, specific infrastructures and methodologies are required to derive robust knowledge and brings innovations to the patient. Drawing upon the national case study of the 32 French regional and university hospitals governance, we highlight key aspects of modern clinical data warehouses (CDWs): governance, transparency, types of data, data reuse, technical tools, documentation, and data quality control processes. Semi-structured interviews as well as a review of reported studies on French CDWs were conducted in a semi-structured manner from March to November 2022. Out of 32 regional and university hospitals in France, 14 have a CDW in production, 5 are experimenting, 5 have a prospective CDW project, 8 did not have any CDW project at the time of writing. The implementation of CDW in France dates from 2011 and accelerated in the late 2020. From this case study, we draw some general guidelines for CDWs. The actual orientation of CDWs towards research requires efforts in governance stabilization, standardization of data schema, and development in data quality and data documentation. Particular attention must be paid to the sustainability of the warehouse teams and to the multilevel governance. The transparency of the studies and the tools of transformation of the data must improve to allow successful multicentric data reuses as well as innovations in routine care.

Author summary

Reusing routine care data does not come free of charges. Attention must be paid to the entire life cycle of the data to create robust knowledge and develop innovation. Building upon the first overview of CDWs in France, we document key aspects of the collection and organization of routine care data into homogeneous databases: governance, transparency, types of data, data reuse main objectives, technical tools, documentation, and data quality control processes. The landscape of CDWs in France dates from 2011 and accelerated in the late 2020, showing a progressive but still incomplete homogenization. National and European projects are emerging, supporting local initiatives in standardization, methodological work, and tooling. From this sample of CDWs, we draw general recommendations aimed at consolidating the potential of routine care data to improve healthcare. Particular attention must be paid to the sustainability of the warehouse teams and to the multilevel governance. The transparency of the data transformation tools and studies must improve to allow successful multicentric data reuses as well as innovations for the patient.

Citation: Doutreligne M, Degremont A, Jachiet P-A, Lamer A, Tannier X (2023) Good practices for clinical data warehouse implementation: A case study in France. PLOS Digit Health 2(7): e0000298. https://doi.org/10.1371/journal.pdig.0000298

Editor: Dukyong Yoon, Yonsei University College of Medicine, REPUBLIC OF KOREA

Copyright: © 2023 Doutreligne et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: MD, AD, PAJ salaries were funded by the French Haute Autorité de Santé (HAS). XT received fundings to participate in interviews and participate to the article redaction. AL received no fundings for this study. The funders validated the study original idea and the study conclusions. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: The first author did a (non-paid) visiting in Leo Anthony Celi’s lab during the first semester of 2023.

Introduction

Real-world data.

Health information systems (HIS) are increasingly collecting routine care data [ 1 – 7 ]. This source of real-world data (RWD) [ 8 ] bears great promises to improve the quality of care. On the one hand, the use of this data translates into direct benefits—primary uses—for the patient by serving as the cornerstone of the developing personalized medicine [ 9 , 10 ]. They also bring indirect benefits—secondary uses—by accelerating and improving knowledge production: on pathologies [ 11 ], on the conditions of use of health products and technologies [ 12 , 13 ], on the measures of their safety [ 14 ], efficacy or usefulness in everyday practice [ 15 ]. They can also be used to assess the organizational impact of health products and technologies [ 16 , 17 ].

In recent years, health agencies in many countries have conducted extensive work to better support the generation and use of real-life data [ 8 , 17 – 19 ]. Study programs have been launched by regulatory agencies: the DARWIN EU program by the European Medicines Agency and the Real World Evidence Program by the Food and Drug Administration [ 20 ].

Clinical data warehouse

In practice, the possibility of mobilizing these routinely collected data depends very much on their degree of concentration, in a gradient that goes from centralization in a single, homogenous HIS to fragmentation in a multitude of HIS with heterogeneous formats. The structure of the HIS reflects the governance structure. Thus, the ease of working with these data depends heavily on the organization of the healthcare actors. The 2 main sources of RWD are insurance claims—more centralized—and clinical data—more fragmented.

Claims data is often collected by national agencies into centralized repositories. In South Korea, the government agency responsible for healthcare system performance and quality (HIRA) is connected to the HIS of all healthcare stakeholders. HIRA data consists of national insurance claims [ 21 ]. England has a centralized healthcare system under the National Health Service (NHS). Despite not having detailed clinical data, this allowed the NHS to merge claims data with detailed data from 2 large urban medicine databases, corresponding to the 2 major software publishers [ 22 ]. This data is currently accessed through Opensafely, a first platform focused on Coronavirus Disease 2019 (COVID-19) research [ 23 ]. In the United States, even if scattered between different insurance providers, claims are pooled into large databases such as Medicare, Medicaid, or IBM MarketScan. Lastly, in Germany, the distinct federal claims have been centralized only very recently [ 24 ].

Clinical data on the other hand, tends to be distributed among many entities, that made different choices, without common management or interoperability. But large institutional data-sharing networks begin to emerge. South Korea very recently launched an initiative to build a national wide data network focused on intensive care. United States is building Chorus4ai, an analysis platform pooling data from 14 university hospitals [ 25 ]. To unlock the potential of clinical data, the German Medical Informatics Initiative [ 26 ] created 4 consortia in 2018. They aim at developing technical and organizational solutions to improve the consistency of clinical data.

Israel stands out as one of the rare countries that pooled together both claims and clinical data at a large scale: half of the population depends on 1 single healthcare provider and insurer [ 27 ].

An infrastructure is needed to pool data data from 1 or more medical information systems—whatever the organizational framework—to homogeneous formats, for management, research, or care reuses [ 28 , 29 ]. Fig 1 illustrates for a CDW, the 4 phases of data flow from the various sources that make up the HIS:

  • Collection and copying of original sources.
  • Integration of sources into a unique database.
  • Deduplication of identifiers.
  • Standardization: A unique data model, independent of the software models harmonizes the different sources in a common schema, possibly with common nomenclatures.
  • Pseudonymization: Removal of directly identifying elements.
  • Provision of subpopulation data sets and transformed datamarts for primary and secondary reuse.
  • Usages thanks to dedicated applications and tools accessing the datamarts and data sets.

In France, the national insurer collects all hospital activity and city care claims into a unique reimbursement database [ 13 ]. However, clinical data is historically scattered at each care site in numerous HISs. Several hospitals deployed efforts for about 10 years to create CDWs from electronic medical records [ 30 – 39 ]. This work has accelerated recently, with the beginning of CDWs structuring at the regional and national levels. Regional cooperation networks are being set up—such as the Ouest Data Hub [ 40 ]. In July 2022, the Ministry of Health opened a 50 million euros call for projects to set up and strengthen a network of hospital CDWs coordinated with the national platform, the Health Data Hub by 2025.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

CDW: Four steps of data flow from the Hospital Information System: (1) collection, (2) transformations, and (3) provisioning. CDW, clinical data warehouse.

https://doi.org/10.1371/journal.pdig.0000298.g001

Based on an overview of university hospital CDWs in France, this study makes general recommendations for properly leveraging the potential of CDWs to improve healthcare. It focuses on: governance, transparency, types of data, data reuse, technical tools, documentation, and data quality control processes.

Material and methods

Interviews were conducted from March to November 2022 with 32 French regional and university hospitals, both with existing and prospective CDWs.

Ethics statement

This work has been authorized by the board of the French High Authority of Health (HAS). Every interviewed participant was asked by email for their participation and informed on the possible forms of publication: a French official report and an international publication. Furthermore, at each interview, every participant has been asked for their agreement before recording the interview. Only 1 participant refused the video to be recorded.

Semi-structured interviews were conducted on the following themes: the initiation and construction of the CDWs, the current status of the project and the studies carried out, opportunities and obstacles, and quality criteria for observational research. S1 Table lists all interviewed people with their team title. The complete form, with the precised questions, is available in S2 Table .

The interview form was sent to participants in advance and then used as a support to conduct the interviews. The interviews lasted 90 min and were recorded for reference.

Quantitative methods

Three tables detailed the structured answers in S1 Text . The first 2 tables deal with the characteristics of the actors and those of the data warehouses. We completed them based on the notes taken during the interviews, the recordings, and by asking the participants for additional information. The third table focuses on ongoing studies in the CDWs. We collected the list of these studies from the dedicated reporting portals, which we found for 8 out of 14 operational CDWs. We developed a classification of studies, based on the typology of retrospective studies described by the OHDSI research network [ 41 ]. We enriched this typology by comparing it with the collected studies resulting in the 6 following categories:

  • Outcome frequency : Incidence or prevalence estimation for a medically well-defined target population.
  • Population characterization : Characterization of a specific set of covariates. Feasibility and prescreening studies belong to this category [ 42 ].
  • Risk factors : Identification of covariates most associated with a well-defined clinical target (disease course, care event). These studies look at association study without quantifying the causal effect of the factors on the outcome of interest.
  • Treatment effect : Evaluation of the effect of a well-defined intervention on a specific outcome target. These studies intend to show a causal link between these 2 variables [ 43 ].
  • Development of diagnostic and prognostic algorithms : Improve or automate a diagnostic or prognostic process, based on clinical data from a given patient. This can take the form of a risk, a preventive score, or the implementation of a diagnostic assistance system. These studies are part of the individualized medicine approach, with the goal of inferring relevant information at the level of individual patient’s files.
  • Medical informatics : Methodological or tool oriented. These studies aim to improve the understanding and capacity for action of researchers and clinicians. They include the evaluation of a decision support tool, the extraction of information from unstructured data, or automatic phenotyping methods.

Studies were classified according to this nomenclature based on their title and description.

Fig 2 summarizes the development state of progress of CDWs in France. Out of 32 regional and university hospitals in France, 14 have a CDW in production, 5 are experimenting, 5 have a prospective CDW project, 8 did not have any CDW project at the time of writing. The results are described for all projects that are at least in the prospective stage minus the 3 that we were unable to interview after multiple reminders (Orléans, Metz, and Caen), resulting in a denominator of 21 university hospitals.

thumbnail

Base map and data from OpenStreetMap and OpenStreetMap Foundation. Link to the base layer of the map: https://github.com/mapnik/mapnik . CDW, clinical data warehouse.

https://doi.org/10.1371/journal.pdig.0000298.g002

Fig 3 shows the history of the implementation of CDWs. A distinction must be made between the first works—in blue—, which systematically precede the regulatory authorization—in green—from the French Commission on Information Technology and Liberties (CNIL).

thumbnail

CDW, clinical data warehouse.

https://doi.org/10.1371/journal.pdig.0000298.g003

The CDWs have so far been initiated by 1 or 2 people from the hospital world with an academic background in bioinformatics, medical informatics, or statistics. The sustainability of the CDW is accompanied by the construction of a cooperative environment between different actors: Medical Information Department (MID), Information Systems Department (IT), Clinical Research Department (CRD), clinical users, and the support of the management or the Institutional Medical Committee. It is also accompanied by the creation of a team, or entity, dedicated to the maintenance and implementation of the CDW. More recent initiatives, such as those of the HCL (Hospitals of the city of Lyon) or the Grand-Est region, are distinguished by an initial, institutional, and high-level support.

The CDW has a federating potential for the different business departments of the hospital with the active participation of the CRD, the IT Department, and the MID. Although there is always an operational CDW team, the human resources allocated to it vary greatly: from half a full-time equivalent to 80 people for the AP-HP, with a median of 6.0 people. The team systematically includes a coordinating physician. It is multidisciplinary with skills in public health, medical informatics, informatics (web service, database, network, infrastructure), data engineering, and statistics.

Historically, the first CDWs were based on in-house solution development. More recently, private actors are offering their services for the implementation and implementation of CDWs (15/21). These services range from technical expertise in order to build up the data flows and data cleaning up to the delivery of a platform integrating the different stages of data processing.

Management of studies

Before starting, projects are systematically analyzed by a scientific and ethical committee. A local submission and follow-up platform is often mentioned (12/21), but its functional scope is not well defined. It ranges from simple authorization of the project to the automatic provision of data into a Trusted Research Environment (TRE) [ 44 ]. The processes for starting a new project on the CDW are always communicated internally but rarely documented publicly (8/21).

Transparency

Ongoing studies in CDWs are unevenly referenced publicly on hospital websites. Some institutions have comprehensive study portals, while others list only a dozen studies on their public site while mentioning several hundreds ongoing projects during interviews. In total, we found 8 of these portals out of 14 CDWs in production. Uses other than ongoing scientific studies are very rarely documented. The publication of the list of ongoing studies is very heterogeneous and fragmented between several sources: clinicaltrials.gov, the mandatory project portal of the Health Data Hub [ 45 ] or the website of the hospital data warehouse.

Strong dependance to the HIS.

CDW data reflect the HIS used on a daily basis by hospital staff. Stakeholders point out that the quality of CDW data and the amount of work required for rapid and efficient reuse are highly dependent on the source HIS. The possibility of accessing data from an HIS in a structured and standardized format greatly simplifies its integration into the CDW and then its reuse.

Categories of data.

Although the software landscape is varied across the country, the main functionalities of HIS are the same. We can therefore conduct an analysis of the content of the CDWs, according to the main categories of common data present in the HIS.

The common base for all CDWs is constituted by data from the Patient Administrative Management software (patient identification, hospital movements) and the billing codes. Then, data flows are progressively developed from the various softwares that make up the HIS. The goal is to build a homogeneous data schema, linking the sources together, controlled by the CDW team. The prioritization of sources is done through thematic projects, which feed the CDW construction process. These projects improve the understanding of the sources involved, by confronting the CDW team with the quality issues present in the data.

Table 1 presents the different ratio of data categories integrated in French CDWs. Structured biology and texts are almost always integrated (20/21 and 20/21). The texts contain a large amount of information. They constitute unstructured data and are therefore more difficult to use than structured tables. Other integrated sources are the hospital drug circuit (prescriptions and administration, 16/21), Intense Care Unit (ICU, 2/21), or nurse forms (4/21). Imaging is rarely integrated (4/21), notably for reasons of volume. Genomic data are well identified, but never integrated, even though they are sometimes considered important and included in the CDW work program.

thumbnail

https://doi.org/10.1371/journal.pdig.0000298.t001

Data reuse.

Today, the main use put forward for the constitution of CDWs is that of scientific research.

The studies are mainly observational (non-interventional). Fig 4 presents the distribution of the 6 categories defined in Quantitative methods for 231 studies collected on the study portals of 9 hospitals. The studies focus first on population characterization (25%), followed by the development of decision support processes (24%), the study of risk factors (18%), and the treatment effect evaluations (16%).

thumbnail

https://doi.org/10.1371/journal.pdig.0000298.g004

The CDWs are used extensively for internal projects such as student theses (at least in 9/21) and serve as an infrastructure for single-service research: their great interest being the de-siloing of different information systems. For most of the institutions interviewed, there is still a lack of resources and maturity of methods and tools for conducting inter-institutional research (such as in the Grand-Ouest region of France) or via European calls for projects (EHDEN). These 2 research networks are made possible by supra-local governance and a common data schema, respectively, eHop [ 46 ] and OMOP [ 47 ]. The Paris hospitals, thanks to its regional coverage and the choice of OMOP, is also well advanced in multicentric research. At the same time, the Grand-Est region is building a network of CDW based on the model of the Grand-Ouest region, also using eHop.

CDW are used for monitoring and management (16/21).

The CDW have sometimes been initiated to improve and optimize billing coding (4/21). The clinical texts gathered in the same database are queried using keywords to facilitate the structuring of information. The data are then aggregated into indicators, some of which are reported at the national level. The construction of indicators from clinical data can also be used for the administrative management of the institution. Finally, closer to the clinic, some actors state that the CDW could also be used to provide regular and appropriate feedback to healthcare professionals on their practices. This feedback would help to increase the involvement and interest of healthcare professionals in CDW projects. The CDW is sometimes of interest for health monitoring (e.g., during COVID-19) or pharmacovigilance (13/21).

Strong interest for CDW in the context of care (13/21).

Some CDWs develop specific applications that provide new functionalities compared to care software. Search engines can be used to query all the hospital’s data gathered in the CDW, without data compartmentalization between different softwares. Dedicated interfaces can then offer a unified view of the history of a patient’s data, with inter-specialty transversality, which is particularly valuable in internal medicine. These cross-disciplinary search tools also enable healthcare professionals to conduct rapid searches in all the texts, for example, to find similar patients [ 32 ]. Uses for prevention, automation of repetitive tasks, and care coordination are also highlighted. Concrete examples are the automatic sorting of hospital prescriptions by order of complexity or the setting up of specialized channels for primary or secondary prevention.

Technical architecture

The technical architecture of modern CDWs has several layers:

  • Data processing: connection and export of source data, diverse transformation (cleaning, aggregation, filtering, standardization).
  • Data storage: database engines, file storage (on file servers or object storage), indexing engines to optimize certain queries.
  • Data exposure: raw data, APIs, dashboards, development and analysis environments, specific web applications.

Supplementary cross-functional components ensure the efficient and secure operation of the platform: identity and authorization management, activity logging, automated administration of servers and applications.

The analysis environment (Jupyterhub or RStudio datalabs) is a key component of the platform, as it allows data to be processed within the CDW infrastructure. A few CDWs had such operational datalab at the time of our study (6/21) and almost all of them have decided to provide it to researchers. Currently, clinical research teams are still often working on data extractions in less secure environments.

Data quality, standard formats

Quality tools..

Systematic data quality monitoring processes are being built in some CDWs. Often (8/21), scripts are run at regular intervals to detect technical anomalies in data flows. Rare data quality investigation tools, in the form of dashboards, are beginning to be developed internally (3/21). Theoretical reflections are underway on the possibility of automating data consistency checks, for example, demographic or temporal. Some facilities randomly pull records from the EHR to compare them with the information in the CDW.

Standard format.

No single standard data model stands out as being used by all CDWs. All are aware of the existence of the OMOP (research standard) [ 47 ] and HL7 FHIR (communication standard) models [ 48 ]. Several CDWs consider the OMOP model to be a central part of the warehouse, particularly for research purposes (9/21). This tendency has been encouraged by the European call for projects EHDEN, launched by the OHDSI research consortium, the originator of this data model. In the Grand-Ouest region of France, the CDWs use the eHop warehouse software. The latter uses a common data model also named eHop. This model will be extended with the future warehouse network of the Grand Est region also choosing this solution. Including this grouping and the other establishments that have chosen eHop, this model includes 12 establishments out of the 32 university hospitals. This allows eHop adopters to launch ambitious interregional projects. However, eHop does not define a standard nomenclature to be used in its model and is not aligned with emerging international standards.

Documentation.

Half of the CDWs have put in place documentation accessible within the organization on data flows, the meaning and proper use of qualified data (10/21 mentioned). This documentation is used by the team that develops and maintains the warehouse. It is also used by users to understand the transformations performed on the data. However, it is never publicly available. No schema of the data once it has been transformed and prepared for analysis is published.

Principal findings

We give the first overview of the CDWs in university hospitals of France with 32 hospitals reviewed. The implementation of CDW dates from 2011 and accelerated in the late 2020. Today, 24 of the university hospitals have an ongoing CDW project. From this case study, some general considerations can be drawn that should be valuable to all healthcare system implementing CDWs on a national scale.

As the CDW becomes an essential component of data management in the hospital, the creation of an autonomous internal team dedicated to data architecture, process automation, and data documentation should be encouraged [ 44 ]. This multidisciplinary team should develop an excellent knowledge of the data collection process and potential reuses in order to qualify the different flows coming from the source IS, standardize them towards a homogenous schema and harmonize the semantics. It should have a sound knowledge of public health, as well as the technical and statistical skills to develop high-quality software that facilitates data reuse.

The resources specific to the warehouse are rare and often taken from other budgets or from project-based credits. While this is natural for an initial prototyping phase, it does not seem adapted to the perennial and transversal nature of the tool. As a research infrastructure of growing importance, it must have the financial and organizational means to plan for the long term.

The governance of the CDW has multiple layers: local within the university hospital, interregional, and national/international. The first level allow to ensure the quality of data integration as well as the pertinence of data reuse by clinicians themselves. The interregional level is well adapted for resources mutualization and collaboration. Finally, the national and international levels assure coordination, encourage consensus for committing choices such as metadata or interoperability, and provide financial, technical, and regulatory support.

Health technology assessment agencies advocate for public registration of comparative observational study protocols before conducting the analysis [ 8 , 17 , 49 ]. They often refer to clinicaltrials.gov as potential but not ideal registration portal for observational studies. The research community advocates for public registrations of all observational studies [ 50 , 51 ]. More recently, it emphasizes the need for more easy data access and the publication of study code [ 29 , 52 , 53 ]. We embrace these recommendations and we point to the unfortunate duplication of these study reporting systems in France. One source could be favored at the national level and the second one automatically fed from the reference source, by agreeing on common metadata.

From a patient’s perspective, there is currently no way to know if their personal data is included for a specific project. Better patient information about the reuse of their data is needed to build trust over the long term. A strict minimum is the establishment and update of the declarative portals of ongoing studies at each institution.

Data and data usage

When using CDW, the analyst has not defined the data collection process and is generally unaware of the context in which the information is logged. This new dimension of medical research requires a much greater development of data science skills to change the focus from the implementation of the statistical design to the data engineering process. Data reuse requires more effort to prepare the data and document the transformations performed.

The more heterogeneous a HIS system is, the less qualitative would be the CDW built on top of it. There is a need for increasing interoperability, to help EHR vendors interfacing the different hospital softwares, thus facilitating CDW development. One step in this direction would be the open source publication of HIS data schema and vocabularies. At the analysis level, international recommendations insist on the need for common data formats [ 52 , 54 ]. However, there is still a lack of adoption of research standards from hospital CDWs to conduct robust studies across multiple sites. Building open-source tools on top of these standards such as those of OHDSI [ 41 ] could foster their adoption. Finally, in many clinical domains, sufficient sample size is hard to obtain without international data-sharing collaborations. Thus, more incitation is needed to maintain and update the terminology mappings between local nomenclatures and international standards.

Many ongoing studies concern the development of decision support processes whose goal is to save time for healthcare professionals. These are often research projects, not yet integrated into routine care. The analysis of study portals and the interviews revealed that data reuse oriented towards primary care is still rare and rarely supported by appropriate funding. The translation from research to clinical practice takes time and need to be supported on the long run to yield substantial results.

Tools, methods, and data formats of CDW lack harmonization due to the strong technical innovation and the presence of many actors. As suggested by the recent report on the use of data for research in the UK [ 44 ], it would be wise to focus on a small number of model technical platforms.

These platforms should favor open-source solutions to assure transparency by default, foster collaboration and consensus, and avoid technological lock-in of the hospitals.

Data quality and documentation

Quality is not sufficiently considered as a relevant scientific topic itself. However, it is the backbone of all research done within a CDW. In order to improve the quality of the data with respect to research uses, it is necessary to conduct continuous studies dedicated to this topic [ 52 , 54 – 56 ]. These studies should contribute to a reflection on methodologies and standard tools for data quality, such as those developed by the OHDSI research network [ 41 ].

Finally, there is a need for open-source publication of research code to ensure quality retrospective research [ 55 , 57 ]. Recent research in data analysis has shown that innumerable biases can lurk in training data sets [ 58 , 59 ]. Open publication of data schemas is considered an indispensable prerequisite for all data science and artificial intelligence uses [ 58 ]. Inspired by data set cards [ 58 ] and data set publication guides, it would be interesting to define a standard CDW card documenting the main data flows.

Limitations

The interviews were conducted in a semi-structured manner within a limited time frame. As a result, some topics were covered more quickly and only those explicitly mentioned by the participants could be recorded. The uneven existence of study portals introduces a bias in the recording of the types of studies conducted on CDW. Those with a transparency portal already have more maturity in use cases.

For clarity, our results are focused on the perimeter of university hospitals. We have not covered the exhaustive healthcare landscape in France. CDW initiatives also exist in primary care, in smaller hospital groups and in private companies.

Conclusions

The French CDW ecosystem is beginning to take shape, benefiting from an acceleration thanks to national funding, the multiplication of industrial players specializing in health data and the beginning of a supra-national reflection on the European Health Data Space [ 60 ]. However, some points require special attention to ensure that the potential of the CDW translates into patient benefits.

The priority is the creation and perpetuation of multidisciplinary warehouse teams capable of operating the CDW and supporting the various projects. A combination of public health, data engineering, data stewardship, statistics, and IT competences is a prerequisite for the success of the CDW. The team should be the privileged point of contact for data exploitation issues and should collaborate closely with the existing hospital departments.

The constitution of a multilevel collaboration network is another priority. The local level is essential to structure the data and understand its possible uses. Interregional, national, and international coordination would make it possible to create thematic working groups in order to stimulate a dynamic of cooperation and mutualization.

A common data model should be encouraged, with precise metadata allowing to map the integrated data, in order to qualify the uses to be developed today from the CDWs. More broadly, open-source documentation of data flows and transformations performed for quality enhancement would require more incentives to unleash the potential for innovation for all health data reusers.

Finally, the question of expanding the scope of the data beyond the purely hospital domain must be asked. Many risk factors and patient follow-up data are missing from the CDWs, but are crucial for understanding pathologies. Combining city data and hospital data would provide a complete view of patient care.

Supporting information

S1 table. list of interviewed stakeholders with their teams..

https://doi.org/10.1371/journal.pdig.0000298.s001

S2 Table. Interview form.

https://doi.org/10.1371/journal.pdig.0000298.s002

S1 Text. Study data tables.

https://doi.org/10.1371/journal.pdig.0000298.s003

Acknowledgments

We want to thanks all participants and experts interviewed for this study. We also want to thanks other people that proof read the manuscript for external review: Judith Fernandez (HAS), Pierre Liot (HAS), Bastien Guerry (Etalab), Aude-Marie Lalanne Berdouticq (Institut Santé numérique en Société), Albane Miron de L’Espinay (ministère de la Santé et de la Prévention), and Caroline Aguado (ministère de la Santé et de la Prévention). We also thank Gaël Varoquaux for his support and advice.

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • Engineering Mathematics
  • Discrete Mathematics
  • Operating System
  • Computer Networks
  • Digital Logic and Design
  • C Programming
  • Data Structures
  • Theory of Computation
  • Compiler Design
  • Computer Org and Architecture

Data Warehousing

A Database Management System (DBMS) stores data in the form of tables and uses an ER model and the goal is ACID properties . For example, a DBMS of a college has tables for students, faculty, etc. 

A Data Warehouse is separate from DBMS, it stores a huge amount of data, which is typically collected from multiple heterogeneous sources like files, DBMS, etc. The goal is to produce statistical results that may help in decision-making. For example, a college might want to see quick different results, like how the placement of CS students has improved over the last 10 years, in terms of salaries, counts, etc. 

Issues Occur while Building the Warehouse

  • When and how to gather data: In a source-driven architecture for gathering data, the data sources transmit new information, either continually (as transaction processing takes place), or periodically (nightly, for example). In a destination-driven architecture, the data warehouse periodically sends requests for new data to the sources. Unless updates at the sources are replicated at the warehouse via two phase commit, the warehouse will never be quite up to-date with the sources. Two-phase commit is usually far too expensive to be an option, so data warehouses typically have slightly out-of-date data. That, however, is usually not a problem for decision-support systems.
  • What schema to use: Data sources that have been constructed independently are likely to have different schemas. In fact, they may even use different data models. Part of the task of a warehouse is to perform schema integration, and to convert data to the integrated schema before they are stored. As a result, the data stored in the warehouse are not just a copy of the data at the sources. Instead, they can be thought of as a materialized view of the data at the sources.
  • Data transformation and cleansing: The task of correcting and preprocessing data is called data cleansing. Data sources often deliver data with numerous minor inconsistencies, which can be corrected. For example, names are often misspelled, and addresses may have street, area, or city names misspelled, or postal codes entered incorrectly. These can be corrected to a reasonable extent by consulting a database of street names and postal codes in each city. The approximate matching of data required for this task is referred to as fuzzy lookup.
  • How to propagate update: Updates on relations at the data sources must be propagated to the data warehouse. If the relations at the data warehouse are exactly the same as those at the data source, the propagation is straightforward. If they are not, the problem of propagating updates is basically the view-maintenance problem.
  • What data to summarize: The raw data generated by a transaction-processing system may be too large to store online. However, we can answer many queries by maintaining just summary data obtained by aggregation on a relation, rather than maintaining the entire relation. For example, instead of storing data about every sale of clothing, we can store total sales of clothing by item name and category.

Need for Data Warehouse 

An ordinary Database can store MBs to GBs of data and that too for a specific purpose. For storing data of TB size, the storage shifted to the Data Warehouse. Besides this, a transactional database doesn’t offer itself to analytics. To effectively perform analytics, an organization keeps a central Data Warehouse to closely study its business by organizing, understanding, and using its historical data for making strategic decisions and analyzing trends. 

Benefits of Data Warehouse

  • Better business analytics: Data warehouse plays an important role in every business to store and analysis of all the past data and records of the company. which can further increase the understanding or analysis of data for the company.
  • Faster Queries: The data warehouse is designed to handle large queries that’s why it runs queries faster than the database.
  • Improved data Quality: In the data warehouse the data you gathered from different sources is being stored and analyzed it does not interfere with or add data by itself so your quality of data is maintained and if you get any issue regarding data quality then the data warehouse team will solve this.
  • Historical Insight: The warehouse stores all your historical data which contains details about the business so that one can analyze it at any time and extract insights from it.

Data Warehouse vs DBMS  

Database

Data Warehouse

A common Database is based on operational or transactional processing. Each operation is an indivisible transaction.

A data Warehouse is based on analytical processing.

Generally, a Database stores current and up-to-date data which is used for daily operations.

A Data Warehouse maintains historical data over time. Historical data is the data kept over years and can used for trend analysis, make future predictions and decision support.

A database is generally application specific.

A stores related data, such as the student details in a school.

A Data Warehouse is integrated generally at the organization level, by combining data from different databases.

A data warehouse integrates the data from one or more databases , so that analysis can be done to get results , such as the best performing school in a city.

Constructing a Database is not so expensive.

Constructing a Data Warehouse can be expensive.

Example Applications of Data Warehousing  

Data Warehousing can be applied anywhere where we have a huge amount of data and we want to see statistical results that help in decision making. 

  • Social Media Websites: The social networking websites like Facebook, Twitter, Linkedin, etc. are based on analyzing large data sets. These sites gather data related to members, groups, locations, etc., and store it in a single central repository. Being a large amount of data, Data Warehouse is needed for implementing the same.
  • Banking: Most of the banks these days use warehouses to see the spending patterns of account/cardholders. They use this to provide them with special offers, deals, etc.
  • Government: Government uses a data warehouse to store and analyze tax payments which are used to detect tax thefts.

Features of Data Warehousing

Data warehousing is essential for modern data management, providing a strong foundation for organizations to consolidate and analyze data strategically. Its distinguishing features empower businesses with the tools to make informed decisions and extract valuable insights from their data.

  • Centralized Data Repository: Data warehousing provides a centralized repository for all enterprise data from various sources, such as transactional databases, operational systems, and external sources. This enables organizations to have a comprehensive view of their data, which can help in making informed business decisions.
  • Data Integration: Data warehousing integrates data from different sources into a single, unified view, which can help in eliminating data silos and reducing data inconsistencies.
  • Historical Data Storage: Data warehousing stores historical data, which enables organizations to analyze data trends over time. This can help in identifying patterns and anomalies in the data, which can be used to improve business performance.
  • Query and Analysis: Data warehousing provides powerful query and analysis capabilities that enable users to explore and analyze data in different ways. This can help in identifying patterns and trends, and can also help in making informed business decisions.
  • Data Transformation: Data warehousing includes a process of data transformation, which involves cleaning, filtering, and formatting data from various sources to make it consistent and usable. This can help in improving data quality and reducing data inconsistencies.
  • Data Mining: Data warehousing provides data mining capabilities, which enable organizations to discover hidden patterns and relationships in their data. This can help in identifying new opportunities, predicting future trends, and mitigating risks.
  • Data Security: Data warehousing provides robust data security features, such as access controls, data encryption, and data backups, which ensure that the data is secure and protected from unauthorized access.

Advantages of Data Warehousing

  • Intelligent Decision-Making: With centralized data in warehouses, decisions may be made more quickly and intelligently.
  • Business Intelligence: Provides strong operational insights through business intelligence.
  • Historical Analysis: Predictions and trend analysis are made easier by storing past data.
  • Data Quality: Guarantees data quality and consistency for trustworthy reporting.
  • Scalability: Capable of managing massive data volumes and expanding to meet changing requirements.
  • Effective Queries: Fast and effective data retrieval is made possible by an optimized structure.
  • Cost reductions: Data warehousing can result in cost savings over time by reducing data management procedures and increasing overall efficiency, even when there are setup costs initially.
  • Data security: Data warehouses employ security protocols to safeguard confidential information, guaranteeing that only authorized personnel are granted access to certain data.

Disadvantages of Data Warehousing

  • Cost: Building a data warehouse can be expensive, requiring significant investments in hardware, software, and personnel.
  • Complexity: Data warehousing can be complex, and businesses may need to hire specialized personnel to manage the system.
  • Time-consuming: Building a data warehouse can take a significant amount of time, requiring businesses to be patient and committed to the process.
  • Data integration challenges: Data from different sources can be challenging to integrate, requiring significant effort to ensure consistency and accuracy.
  • Data security: Data warehousing can pose data security risks, and businesses must take measures to protect sensitive data from unauthorized access or breaches.

There can be many more applications in different sectors like E-Commerce, telecommunications, Transportation Services, Marketing and Distribution, Healthcare, and Retail. 

Data warehousing in database management systems (DBMS) enables integrated data management, providing scalable solutions for enhanced business intelligence and decision-making within businesses. Its advantages in data quality, historical analysis, and scalability highlight its critical role in deriving important insights for a competitive edge, even in the face of implementation problems.  

Please Login to comment...

Similar reads.

  • Computer Subject

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Penn State  Logo

  • Help & FAQ

Case study: a data warehouse for an academic medical center.

  • Department of Surgery
  • Division of Outcomes, Research and Quality
  • Department of Medicine
  • Department of Public Health Sciences
  • Cancer Institute, Cancer Control
  • Penn State Cancer Institute

Research output : Contribution to journal › Article › peer-review

The clinical data repository (CDR) is a frequently updated relational data warehouse that provides users with direct access to detailed, flexible, and rapid retrospective views of clinical, administrative, and financial patient data for the University of Virginia Health System. This article presents a case study of the CDR, detailing its five-year history and focusing on the unique role of data warehousing in an academic medical center. Specifically, the CDR must support multiple missions, including research and education, in addition to administration and management. Users include not only analysts and administrators but clinicians, researchers, and students.

Original languageEnglish (US)
Pages (from-to)165-175
Number of pages11
Journal
Volume15
Issue number2
StatePublished - 2001

All Science Journal Classification (ASJC) codes

  • General Medicine

This output contributes to the following UN Sustainable Development Goals (SDGs)

Other files and links

  • Link to publication in Scopus
  • Link to the citations in Scopus

Fingerprint

  • Data Warehousing Medicine & Life Sciences 100%
  • Administrative Personnel Medicine & Life Sciences 34%
  • History Medicine & Life Sciences 25%
  • Students Medicine & Life Sciences 23%
  • Research Personnel Medicine & Life Sciences 23%
  • Clinical Studies Medicine & Life Sciences 23%
  • Education Medicine & Life Sciences 21%
  • Health Medicine & Life Sciences 16%

T1 - Case study

T2 - a data warehouse for an academic medical center.

AU - Einbinder, J. S.

AU - Scully, K. W.

AU - Pates, R. D.

AU - Schubart, J. R.

AU - Reynolds, R. E.

N2 - The clinical data repository (CDR) is a frequently updated relational data warehouse that provides users with direct access to detailed, flexible, and rapid retrospective views of clinical, administrative, and financial patient data for the University of Virginia Health System. This article presents a case study of the CDR, detailing its five-year history and focusing on the unique role of data warehousing in an academic medical center. Specifically, the CDR must support multiple missions, including research and education, in addition to administration and management. Users include not only analysts and administrators but clinicians, researchers, and students.

AB - The clinical data repository (CDR) is a frequently updated relational data warehouse that provides users with direct access to detailed, flexible, and rapid retrospective views of clinical, administrative, and financial patient data for the University of Virginia Health System. This article presents a case study of the CDR, detailing its five-year history and focusing on the unique role of data warehousing in an academic medical center. Specifically, the CDR must support multiple missions, including research and education, in addition to administration and management. Users include not only analysts and administrators but clinicians, researchers, and students.

UR - http://www.scopus.com/inward/record.url?scp=0035376039&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0035376039&partnerID=8YFLogxK

M3 - Article

C2 - 11452578

AN - SCOPUS:0035376039

SN - 1099-811X

JO - Journal of healthcare information management : JHIM

JF - Journal of healthcare information management : JHIM

IMAGES

  1. Data Warehouse Case Study

    case study of data warehouse

  2. (PDF) A CASE STUDY ON DATA MINING AND DATA WAREHOUSE

    case study of data warehouse

  3. Case

    case study of data warehouse

  4. Case Study

    case study of data warehouse

  5. Data Warehouse Case Study Free Essay Example

    case study of data warehouse

  6. Case study: Enterprise Data Warehouse and Master Data Management

    case study of data warehouse

VIDEO

  1. Case Study: Data Driven business with SAP Datasphere

  2. Data Science Interview

  3. Difference between Data Analytics and Data Science . #shorts #short

  4. (Mastering JMP) Visualizing and Exploring Data

  5. Implement Data Warehouse with SQL Server 2012

  6. CASESTUDY

COMMENTS

  1. Successful Data Warehousing in Real Life

    The warehouse contains large amounts of historical data and allows you to study past trends and issues to predict events and improve the business structure. ... a successful data warehouse implementation will vary depending on the goals of the organization and the particular use case for data warehousing. Some common characteristics are ...

  2. Real-Time Data Warehouse Examples (Real World Applications)

    Real-Time Data Warehouse: 3 Real-Life Examples For Enhanced Business Analytics. To truly highlight the importance of real-time data warehouses, let's discuss some real-life case studies. Case Study 1: Beyerdynamic Beyerdynamic, an audio product manufacturer from Germany, was facing difficulties with its previous method of analyzing sales data ...

  3. Real World Data Warehousing Examples: Use Cases and Applications

    On that note, data warehouses are used for business analysis, data and market analytics, and business reporting. Data warehouses typically store historical data by integrating copies of transaction data from disparate sources. Data warehouses can also use real-time data feeds for reports that use the most current, integrated information.

  4. 10 Use Cases for Data Warehouses

    A data warehouse is a data management system used primarily for business intelligence (BI) and analytics. Data warehouses store large amounts of historical data from a wide range of sources and make it available for queries and analysis. These systems are capable of storing large amounts of unstructured data, unlike traditional relational ...

  5. Data Warehouse

    Based on my prior experience as Data Engineer and Analyst, I will explain Data Warehousing and Dimensional modeling using an e-Wallet case study. — Manoj. Data Warehouse. A data warehouse is a large collection of business-related historical data that would be used to make business decisions.

  6. Introduction to data warehouses: use cases, design and more

    Step 4: Combine batch & streaming. Modern data warehouses can integrate both batch and real-time streaming data sources. Batch data pulled from systems of record on periodic extracts provides in-depth historic context. Integrating streaming data from message buses, IOT devices or transactional logs adds a real-time dimension enabling up-to-the ...

  7. 10 Benefits and Use Cases of a Data Warehouse

    Using a data warehouse, business users can generate reports and queries on their own. Users can access all the organization's data from one interface instead of having to log into multiple systems. Easier access to data means less time spent on data retrieval and more time on data analysis. 4. Auditability.

  8. A Data Warehouse Implementation on AWS

    Data Lake. The first part of this case study is the Data Lake. A Data Lake is a repository where data from multiple sources is stored. It allows for working with structured and unstructured data. In this case study, the Data Lake is used as a staging area allowing for centralizing all different data sources.

  9. What is a Data Warehouse?

    A data warehouse, or enterprise data warehouse (EDW), is a system that aggregates data from different sources into a single, central, consistent data store to support data analysis, data mining, artificial intelligence (AI) and machine learning. A data warehouse system enables an organization to run powerful analytics on large amounts of data ...

  10. Case Study: Cornell University Automates Data Warehouse Infrastructure

    Cornell was using Cognos Data Manager to transform and merge data into an Oracle Data Warehouse. IBM purchased Data Manager and decided to end support for the product. "Unfortunately, we had millions of lines of code written in Data Manager, so we had to shop around for a replacement," said Christen. He looked at it as an opportunity to add ...

  11. A case study for data warehousing courseware

    Data Warehouse provides an effective way for the analysis of mass data and helps in the decision making process. The objective of this project is to develop a web-based interactive courseware to help data warehouse designers to enhance understanding of the key concepts of OLAP using a case study approach.

  12. PDF Data Warehouse Portfolio

    CASE STUDY: www.d-Wise.com A Tier 1 Pharmaceutical Manufacturer Business Problem The client business included pre-clinical toxicology operations using multiple vendor data collection source systems, with no way to carry out common reporting. Key Issues • How do we move to a centralized data warehouse to streamline reporting and analysis ...

  13. PDF A Case Study of Success Factors for Data Warehouse ...

    We follow a case research strategy (Benbasat et al. 1987) to study the case of data warehouse implementa-tion for sales planning at the W&H Group. One of the authors participated directly in the data warehousing project at the W&H Group from January to May 2015, aiding the implementation of the data ...

  14. Data Silos: A Business Case for a New Data Warehouse

    A. Elimination of Data Silos: Building a new data warehouse will lead to the removal of data silos, allowing for a unified view of the organisation's data. This centralised approach will foster ...

  15. Data Warehousing Case Study

    Our team can transform data warehousing from a contentious challenge to a value delivering asset. the d-Wise approach involves: Data-driven requirements gathering, comprehensively addressing the needs of the business users, IT, and enterprise architectures. Technology-agnostic solutions that ensures the right solution for your unique needs.

  16. Enterprise Data Warehouse Case Study

    WCI architected a roadmap that would take ERP data from 8 main databases and put it into the Enterprise Data Warehouse. This entailed integrating the 5 Oracle ERP instances with the 3 SAP ERPs. Rapid Marts were also implemented in the Oracle ERP systems to improve the flow of the project. Creating a Team. This was a large undertaking so, along ...

  17. Demystifying Data Warehousing

    Data Warehouse Database: This is the core repository where data from various sources is stored in a structured and organized ... Case studies showcasing successful data warehousing implementations.

  18. Good practices for clinical data warehouse implementation: A case study

    Author summary Reusing routine care data does not come free of charges. Attention must be paid to the entire life cycle of the data to create robust knowledge and develop innovation. Building upon the first overview of CDWs in France, we document key aspects of the collection and organization of routine care data into homogeneous databases: governance, transparency, types of data, data reuse ...

  19. Case Study: Designing a dimensional model for a cargo shipper

    We'll draw on a case study of a transoceanic shipping company to explore the process of designing a data warehouse schema. A schema or a dimensional model is a logical description of the entire data warehouse. We'll consider a star schema, which is perhaps the most straightforward data warehouse schema.

  20. Data Warehousing

    A Data Warehouse is separate from DBMS, it stores a huge amount of data, which is typically collected from multiple heterogeneous sources like files, DBMS, etc. The goal is to produce statistical results that may help in decision-making. For example, a college might want to see quick different results, like how the placement of CS students has ...

  21. Case study: a data warehouse for an academic medical center

    This article presents a case study of the CDR, detailing its five-year history and focusing on the unique role of data warehousing in an academic medical center. Specifically, the CDR must support multiple missions, including research and education, in addition to administration and management. Users include not only analysts and administrators ...

  22. (PDF) Data Warehouse A case study Data for Data Warehouse as a Real

    The case study provides foundation knowledge of data warehouse as a real time. The case study reveals that the data ware house maximizes bus iness profitability, and support managers making ...

  23. Data Warehousing Failures: Case Studies and Findings

    Data Warehousing Failures. Eight studies of data warehousing failures are presented. They were written based on interviews with people who were associated with the projects. The extent of the failure varies with the organization, but in all cases, the project was at least a disappointment. Read the cases and prepare a one or two page discussion ...