• Math Article

Graphical Representation

Class Registration Banner

Graphical Representation is a way of analysing numerical data. It exhibits the relation between data, ideas, information and concepts in a diagram. It is easy to understand and it is one of the most important learning strategies. It always depends on the type of information in a particular domain. There are different types of graphical representation. Some of them are as follows:

  • Line Graphs – Line graph or the linear graph is used to display the continuous data and it is useful for predicting future events over time.
  • Bar Graphs – Bar Graph is used to display the category of data and it compares the data using solid bars to represent the quantities.
  • Histograms – The graph that uses bars to represent the frequency of numerical data that are organised into intervals. Since all the intervals are equal and continuous, all the bars have the same width.
  • Line Plot – It shows the frequency of data on a given number line. ‘ x ‘ is placed above a number line each time when that data occurs again.
  • Frequency Table – The table shows the number of pieces of data that falls within the given interval.
  • Circle Graph – Also known as the pie chart that shows the relationships of the parts of the whole. The circle is considered with 100% and the categories occupied is represented with that specific percentage like 15%, 56%, etc.
  • Stem and Leaf Plot – In the stem and leaf plot, the data are organised from least value to the greatest value. The digits of the least place values from the leaves and the next place value digit forms the stems.
  • Box and Whisker Plot – The plot diagram summarises the data by dividing into four parts. Box and whisker show the range (spread) and the middle ( median) of the data.

Graphical Representation

General Rules for Graphical Representation of Data

There are certain rules to effectively present the information in the graphical representation. They are:

  • Suitable Title: Make sure that the appropriate title is given to the graph which indicates the subject of the presentation.
  • Measurement Unit: Mention the measurement unit in the graph.
  • Proper Scale: To represent the data in an accurate manner, choose a proper scale.
  • Index: Index the appropriate colours, shades, lines, design in the graphs for better understanding.
  • Data Sources: Include the source of information wherever it is necessary at the bottom of the graph.
  • Keep it Simple: Construct a graph in an easy way that everyone can understand.
  • Neat: Choose the correct size, fonts, colours etc in such a way that the graph should be a visual aid for the presentation of information.

Graphical Representation in Maths

In Mathematics, a graph is defined as a chart with statistical data, which are represented in the form of curves or lines drawn across the coordinate point plotted on its surface. It helps to study the relationship between two variables where it helps to measure the change in the variable amount with respect to another variable within a given interval of time. It helps to study the series distribution and frequency distribution for a given problem.  There are two types of graphs to visually depict the information. They are:

  • Time Series Graphs – Example: Line Graph
  • Frequency Distribution Graphs – Example: Frequency Polygon Graph

Principles of Graphical Representation

Algebraic principles are applied to all types of graphical representation of data. In graphs, it is represented using two lines called coordinate axes. The horizontal axis is denoted as the x-axis and the vertical axis is denoted as the y-axis. The point at which two lines intersect is called an origin ‘O’. Consider x-axis, the distance from the origin to the right side will take a positive value and the distance from the origin to the left side will take a negative value. Similarly, for the y-axis, the points above the origin will take a positive value, and the points below the origin will a negative value.

Principles of graphical representation

Generally, the frequency distribution is represented in four methods, namely

  • Smoothed frequency graph
  • Pie diagram
  • Cumulative or ogive frequency graph
  • Frequency Polygon

Merits of Using Graphs

Some of the merits of using graphs are as follows:

  • The graph is easily understood by everyone without any prior knowledge.
  • It saves time
  • It allows us to relate and compare the data for different time periods
  • It is used in statistics to determine the mean, median and mode for different data, as well as in the interpolation and the extrapolation of data.

Example for Frequency polygonGraph

Here are the steps to follow to find the frequency distribution of a frequency polygon and it is represented in a graphical way.

  • Obtain the frequency distribution and find the midpoints of each class interval.
  • Represent the midpoints along x-axis and frequencies along the y-axis.
  • Plot the points corresponding to the frequency at each midpoint.
  • Join these points, using lines in order.
  • To complete the polygon, join the point at each end immediately to the lower or higher class marks on the x-axis.

Draw the frequency polygon for the following data

10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90
4 6 8 10 12 14 7 5

Mark the class interval along x-axis and frequencies along the y-axis.

Let assume that class interval 0-10 with frequency zero and 90-100 with frequency zero.

Now calculate the midpoint of the class interval.

0-10 5 0
10-20 15 4
20-30 25 6
30-40 35 8
40-50 45 10
50-60 55 12
60-70 65 14
70-80 75 7
80-90 85 5
90-100 95 0

Using the midpoint and the frequency value from the above table, plot the points A (5, 0), B (15, 4), C (25, 6), D (35, 8), E (45, 10), F (55, 12), G (65, 14), H (75, 7), I (85, 5) and J (95, 0).

To obtain the frequency polygon ABCDEFGHIJ, draw the line segments AB, BC, CD, DE, EF, FG, GH, HI, IJ, and connect all the points.

introduction to graphical representation of data

Frequently Asked Questions

What are the different types of graphical representation.

Some of the various types of graphical representation include:

  • Line Graphs
  • Frequency Table
  • Circle Graph, etc.

Read More:  Types of Graphs

What are the Advantages of Graphical Method?

Some of the advantages of graphical representation are:

  • It makes data more easily understandable.
  • It saves time.
  • It makes the comparison of data more efficient.
MATHS Related Links

Leave a Comment Cancel reply

Your Mobile number and Email id will not be published. Required fields are marked *

Request OTP on Voice Call

Post My Comment

introduction to graphical representation of data

Very useful for understand the basic concepts in simple and easy way. Its very useful to all students whether they are school students or college sudents

Thanks very much for the information

introduction to graphical representation of data

Register with BYJU'S & Download Free PDFs

Register with byju's & watch live videos.

Introduction to Graphs

Table of Contents

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

15 December 2020                 

Read time: 6 minutes

Introduction

What are graphs?

What are the different types of data?

What are the different types of graphical representations?

The graph is nothing but an organized representation of data. It helps us to understand the data. Data are the numerical information collected through observation.

The word data came from the Latin word Datum which means “something given”

After a research question is developed, data is being collected continuously through observation. Then it is organized, summarized, classified, and then represented graphically.

Differences between Data and information: Data is the raw fact without any add on but the information is the meaning derived from data.

Data

Information

Raw facts of things

Data with exact meaning

No contextual meaning

Processed data and organized context

Just numbers and text

 

Introduction to Graphs-PDF

The graph is nothing but an organized representation of data. It helps us to understand the data. Data are the numerical information collected through observation. Here is a downloadable PDF to explore more.

📥

  • Line and Bar Graphs Application
  • Graphs in Mathematics & Statistics

What are the different Types of Data?

There are two types of Data :

Types of Data

Quantitative

The data which are statistical or numerical are known as Quantitive data. Quantitive data is generated through. Quantitative data is also known as Structured data. Experiments, Tests, Surveys, Market Report.

Quantitive data is again divided into Continuous data and Discrete data.

Continuous Data

Continuous data is the data which can have any value. That means Continuous data can give infinite outcomes so it should be grouped before representing on a graph.

  • The speed of a vehicle as it passes a checkpoint
  • The mass of a cooking apple
  • The time taken by a volunteer to perform a task

Discrete Data

Discrete data can have certain values. That means only a finite number can be categorized as discrete data.

  • Numbers of cars sold at a dealership during a given month
  • Number of houses in certain block
  • Number of fish caught on a fishing trip
  • Number of complaints received at the office of airline on a given day
  • Number of customers who visit at bank during any given hour
  • Number of heads obtained in three tosses of a coin

Differences between Discrete and Continuous data

  • Numerical data could be either discrete or continuous
  • Continuous data can take any numerical value (within a range); For example, weight, height, etc.
  • There can be an infinite number of possible values in continuous data
  • Discrete data can take only certain values by finite ‘jumps’, i.e., it ‘jumps’ from one value to another but does not take any intermediate value between them (For example, number of students in the class)

Qualitative

Data that deals with description or quality instead of numbers are known as Quantitative data. Qualitative data is also known as unstructured data. Because this type of data is loosely compact and can’t be analyzed conventionally.

Different Types of Graphical Representations

There are many types of graph we can use to represent data. They are as follows,

A bar graph or chart is a way to represent data by rectangular column or bar. The heights or length of the bar is proportional to the values.

A bar graph or chart

A line graph is a type of graph where the information or data is plotted as some dots which are known as markers and then they are added to each other by a straight line.

The line graph is normally used to represent the data that changes over time.

A line graph

A histogram graph is a graph where the information is represented along with the height of the rectangular bar. Though it does look like a bar graph, there is a fundamental difference between them. With the histogram, each column represents a range of quantitative data when a bar graph represents categorical variables.

Histogram and Piechart

The other name of the pie chart is a circle graph. It is a circular chart where numerical information represents as slices or in fractional form or percentage where the whole circle is 100%.

Pie chart

  • Stem and leaf plot

The stem and leaf plot is a way to represents quantitative data according to frequency ranges or frequency distribution.

In the stem and leaf plot, each data is split into stem and leaf, which is 32 will be split into 3 stems and 2 leaves.

Stem and leaf plot

Frequency table: Frequency means the number of occurrences of an event. A frequency distribution table is a graph or chart which shows the frequency of events. It is denoted as ‘f’ .

Frequency table

Pictograph or Pictogram is the earliest way to represents data in a pictorial form or by using symbols or images. And each image represents a particular number of things.

Pictograph or Pictogram

According to the above-mentioned Pictograph, the number of Appels sold on Monday is 6x2=12.

  • Scatter diagrams

Scatter diagram or scatter plot is a way of graphical representation by using cartesian coordinates of two variables. The plot shows the relationship between two variables. Below there is a data table as well as a Scattergram as per the given data.

ºc
14.2º $215
16.4º $325
11.9º $185
15.2º $332
18.5º $406
22.1º $522
19.4º $412
25.1º $614

What is the meaning of Graphical representation?

Graphical representation is a way to represent and analyze quantitive data. A graph is a kind of a chart where data are plotted as variables across the coordinate. It became easy to analyze the extent of change of one variable based on the change of other variables.

Principles of graphical representation

The principles of graphical representation are algebraic. In a graph, there are two lines known as Axis or Coordinate axis. These are the X-axis and Y-axis. The horizontal axis is the X-axis and the vertical axis is the Y-axis. They are perpendicular to each other and intersect at O or point of Origin.

On the right side of the Origin, the Xaxis has a positive value and on the left side, it has a negative value. In the same way, the upper side of the Origin Y-axis has a positive value where the down one is with a negative value.

When X-axis and y-axis intersected each other at the origin it divides the plane into four parts which are called Quadrant I, Quadrant II, Quadrant III, Quadrant IV.

Principles of graphical representation

The location on the coordinate plane is known as the ordered pair and it is written as (x,y). That means the first value will be on the x-axis and the second one is on the y-axis. When we will plot any coordinate, we always have to start counting from the origin and have to move along the x-axis, if it is positive then to the right side, and if it is negative then to the left side. Then from the x-axis, we have to plot the y’s value, which means we have to move up for positive value or down if the value is negative along with the y-axis.

In the following graph, 1st ordered pair (2,3) where both the values of x and y are positive and it is on quadrant I. 2nd ordered pair (-3,1), here the value of x is negative and value of y is positive and it is in quadrant II. 3rd ordered pair (-1.5, -2.5), here the value of x as well as y both are Negative and in quadrant III.

Principles of graphical representation

Methods of representing a frequency distribution

There are four methods to represent a frequency distribution graphically. These are,

  • Smoothed Frequency graph
  • Cumulative frequency graph or Ogive.
  • Pie diagram.

Advantages and Disadvantages of Graphical representation of data

  • It improves the way of analyzing and learning as the graphical representation makes the data easy to understand.
  • It can be used in almost all fields from mathematics to physics to psychology and so on.
  • It is easy to understand for its visual impacts.
  • It shows the whole and huge data in an instance.

The main disadvantage of graphical representation of data is that it takes a lot of effort as well as resources to find the most appropriate data and then represents it graphically.

You may also like:

  • Graphing a Quadratic Function
  • Empirical Relationship Between Mean, Median, and Mode

Not only in mathematics but almost in every field the graph is a very important way to store, analyze, and represents information. After any research work or after any survey the next step is to organize the observation or information and plotting them on a graph paper or plane. The visual representation of information makes the understanding of crucial components or trends easier.

A huge amount of data can be store or analyze in a small space.

The graphical representation of data helps to decide by following the trend.

A complete Idea: Graphical representation constitutes a clear and comprehensive idea in the minds of the audience. Reading a large number (say hundreds) of pages may not help to make a decision. Anyone can get a clear idea just by looking into the graph or design.

Graphs are a very conceptual topic, so it is essential to get a complete understanding of the concept. Graphs are great visual aids and help explain numerous things better, they are important in everyday life. Get better at graphs with us, sign up for a free trial . 

About Cuemath

Cuemath, a student-friendly mathematics and coding platform, conducts regular Online Classes for academics and skill-development, and their Mental Math App, on both iOS and Android , is a one-stop solution for kids to develop multiple skills. Understand the Cuemath Fee structure and sign up for a free trial.

Frequently Asked Questions (FAQs)

What is data.

Data are characteristics or information, usually numerical, that are collected through observation.

How do you differentiate between data and information?

Data is the raw fact without any add on but the information is the meaning derived from data.

What are the types of data?

There are two types of Data:

Two types of Data

What are the ways to represent data?

Tables, charts and graphs are all ways of representing data , and they can be used for two broad purposes. The first is to support the collection, organisation and analysis of data as part of the process of a scientific study.

- Tables, charts and graphs are all ways of representing data, and they can be used for two broad purposes. The first is to support the collection, organisation and analysis of data as part of the process of a scientific study.

What are the different types of graphs?

Different types of graphs include:

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

2.1: Introduction

  • Last updated
  • Save as PDF
  • Page ID 22223

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

Learning Objectives

By the end of this chapter, the student should be able to:

  • Display data graphically and interpret graphs: stemplots, bar charts, frequency polygons, histograms, etc.

Once you have collected data, what will you do with it? Data can be described and presented in many different formats. For example, suppose you want to find a change in temperature in a particular city over time. Looking at all the raw data can be confusing and overwhelming. A better way to look at that data would be to create a graph that displays the data in a visual manner. Then patterns can more easily be discerned.

alt

In this chapter, you will study graphical ways to describe and display your data. You will learn to create, and more importantly, interpret a variety of graph types, and you will learn when to use each type of graph.

A statistical graph is a tool that helps you learn about the shape or distribution of a sample or a population. A graph can be a more effective way of presenting data than a mass of numbers because we can see where data clusters and where there are only a few data values. Newspapers and the Internet use graphs to show trends and to enable readers to compare facts and figures quickly. Statisticians often graph data first to get a picture of the data. Then, more formal tools may be applied.

Some of the types of graphs that are used to summarize and organize data are the dot plot, the bar graph, the histogram, the stem-and-leaf plot, the frequency polygon (a type of broken line graph), the pie chart, and the box plot. In this chapter, we will briefly look at stem-and-leaf plots, line graphs, and bar graphs, as well as frequency polygons, and time series graphs.

This book contains instructions for constructing some graph types using Excel.

Contributors and Attributions

Barbara Illowsky and Susan Dean (De Anza College) with many other contributing authors. Content produced by OpenStax College is licensed under a Creative Commons Attribution License 4.0 license. Download for free at http://cnx.org/contents/[email protected] .

  • For Individuals
  • For Businesses
  • For Universities
  • For Governments
  • Online Degrees
  • Find your New Career
  • Join for Free

Data Visualization: Definition, Benefits, and Examples

Data visualization helps data professionals tell a story with data. Here’s a definitive guide to data visualization.

[Featured Image]:  Data visualization analysts presenting and information with the team.

Data visualization is a powerful way for people, especially data professionals, to display data so that it can be interpreted easily. It helps tell a story with data, by turning spreadsheets of numbers into stunning graphs and charts.

In this article, you’ll learn all about data visualization, including its definition, benefits, examples, types, and tools. If you decide you want to learn the skills to incorporate it into your job, we'll point you toward online courses you can do from anywhere.

What is data visualization?

Data visualization is the representation of information and data using charts, graphs, maps, and other visual tools. These visualizations allow us to easily understand any patterns, trends, or outliers in a data set.

Data visualization also presents data to the general public or specific audiences without technical knowledge in an accessible manner. For example, the health agency in a government might provide a map of vaccinated regions.

The purpose of data visualization is to help drive informed decision-making and to add colorful meaning to an otherwise bland database.

Benefits of data visualization

Data visualization can be used in many contexts in nearly every field, like public policy, finance, marketing, retail, education, sports, history, and more. Here are the benefits of data visualization:

Storytelling: People are drawn to colors and patterns in clothing, arts and culture, architecture, and more. Data is no different—colors and patterns allow us to visualize the story within the data.

Accessibility: Information is shared in an accessible, easy-to-understand manner for a variety of audiences.

Visualize relationships: It’s easier to spot the relationships and patterns within a data set when the information is presented in a graph or chart.

Exploration: More accessible data means more opportunities to explore, collaborate, and inform actionable decisions.

Data visualization and big data

Companies collect “ big data ” and synthesize it into information. Data visualization helps portray significant insights—like a heat map to illustrate regions where individuals search for mental health assistance. To synthesize all that data, visualization software can be used in conjunction with data collecting software.

Tools for visualizing data

There are plenty of data visualization tools out there to suit your needs. Before committing to one, consider researching whether you need an open-source site or could simply create a graph using Excel or Google Charts. The following are common data visualization tools that could suit your needs. 

Google Charts

ChartBlocks

FusionCharts

Get started with a free tool

No matter the field, using visual representations to illustrate data can be immensely powerful. Tableau has a free public tool that anyone can use to create stunning visualizations for a school project, non-profit, or small business. 

Types of data visualization

Visualizing data can be as simple as a bar graph or scatter plot but becomes powerful when analyzing, for example, the median age of the United States Congress vis-a-vis the median age of Americans . Here are some common types of data visualizations:

Table: A table is data displayed in rows and columns, which can be easily created in a Word document or Excel spreadsheet.

Chart or graph: Information is presented in tabular form with data displayed along an x and y axis, usually with bars, points, or lines, to represent data in comparison. An infographic is a special type of chart that combines visuals and words to illustrate the data.

Gantt chart: A Gantt chart is a bar chart that portrays a timeline and tasks specifically used in project management.

Pie chart: A pie chart divides data into percentages featured in “slices” of a pie, all adding up to 100%. 

Geospatial visualization: Data is depicted in map form with shapes and colors that illustrate the relationship between specific locations, such as a choropleth or heat map.

Dashboard: Data and visualizations are displayed, usually for business purposes, to help analysts understand and present data.

Data visualization examples

Using data visualization tools, different types of charts and graphs can be created to illustrate important data. These are a few examples of data visualization in the real world:

Data science: Data scientists and researchers have access to libraries using programming languages or tools such as Python or R, which they use to understand and identify patterns in data sets. Tools help these data professionals work more efficiently by coding research with colors, plots, lines, and shapes.

Marketing: Tracking data such as web traffic and social media analytics can help marketers analyze how customers find their products and whether they are early adopters or more of a laggard buyer. Charts and graphs can synthesize data for marketers and stakeholders to better understand these trends. 

Finance: Investors and advisors focused on buying and selling stocks, bonds, dividends, and other commodities will analyze the movement of prices over time to determine which are worth purchasing for short- or long-term periods. Line graphs help financial analysts visualize this data, toggling between months, years, and even decades.

Health policy: Policymakers can use choropleth maps, which are divided by geographical area (nations, states, continents) by colors. They can, for example, use these maps to demonstrate the mortality rates of cancer or ebola in different parts of the world.  

Tackle big business decisions by backing them up with data analytics. Google's Data Analytics Professional Certificate can boost your skills:

Jobs that use data visualization

From marketing to data analytics, data visualization is a skill that can be beneficial to many industries. Building your skills in data visualization can help in the following jobs:

Data visualization analyst: As a data visualization analyst (or specialist), you’d be responsible for creating and editing visual content such as maps, charts, and infographics from large data sets. 

Data visualization engineer: Data visualization engineers and developers are experts in both maneuvering data with SQL, as well as assisting product teams in creating user-friendly dashboards that enable storytelling.

Data analyst: A data analyst collects, cleans, and interprets data sets to answer questions or solve business problems.

Data is everywhere. In creative roles such as graphic designer , content strategist, or social media specialist, data visualization expertise can help you solve challenging problems. You could create dashboards to track analytics as an email marketer or make infographics as a communications designer.

On the flip side, data professionals can benefit from data visualization skills to tell more impactful stories through data.

Read more: 5 Data Visualization Jobs (+ Ways to Build Your Skills Now)

Dive into data visualization

Learn the basics of data visualization with the University of California Davis’ Data Visualization with Tableau Specialization . You’ll leverage Tableau’s library of resources to learn best practices for data visualization and storytelling, learning from real-world and journalistic examples. Tableau is one of the most respected and accessible data visualization tools. 

To learn more about data visualization using Excel and Cognos Analytics, take a look at IBM’s Data Analysis and Visualization Foundations Specialization .

Keep reading

Coursera staff.

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.

The Sheridan Libraries

  • Data Visualization
  • Sheridan Libraries
  • Introduction to Data Visualization
  • Data Vis Quote

What is Data Visualization?

Why visualize data, planning a data visualization.

  • Designing Effective Data Visualizations
  • Network Visualization
  • Scientific Visualization
  • Workshops, Tutorials, and Resources

Data Vis Planning

Learn the steps to plan and implement a data visualization.

Data Vis Design

Designing an Effective Data Visualization

Learn how to design an effective data visualization..

introduction to graphical representation of data

Data Visualization Resources

Learn about the software packages, programming languages, and data visualization libraries available to you.

Data Visualization Consultants

introduction to graphical representation of data

"There is a magic in graphs."

"the profile of a curve reveals in a flash a whole situation — the life history of an epidemic, a panic, or an era of prosperity. the curve informs the mind, awakens the imagination, convinces .", - henry d. hubbard, national bureau of standards, data visualization is the graphical representation of data for understanding and communication. this encompasses two primary classes of visualization:, information visualization - visualization of data. this can either be:         exploratory: you are trying to explore and understand patterns and trends within your data.         explanatory:  there is something in your data you would like to communicate to your audience., scientific visualization - scientific visualization involves the visualization of data with an inherent spatial component. this can be the visualization of scalar, vector, and tensor fields. common areas of scientific visualization include computational fluid dynamics, medical imaging and analysis and weather data analysis..

Good data visualizations allow us to reason and think effectively about our data. By presenting information visually, it allows us offload internal cognition to the perceptual system. If we see numerical data in a table, we may be able to find a trend, but it will take a significant amount of work on our part to recognize and conceptualize that trend. By plotting that data visually, that trend becomes immediately clear to our mind through our perceptual system.

A good example of this is "Anscombe's quartet", four datasets that share the same descriptive statistics, including mean, variance, and correlation.

Anscombe's Quartet Table

Upon visual inspection, it becomes immediately clear that these datasets, while seemingly identical according to common summary statistics, are each unique. This is the power of effective data visualization: it allows us to bypass cognition by communicating directly with our perceptual system.

Anscombe's Quartet plot published under the terms of "Creative Commons Attribution-Share Alike", source: http://commons.wikimedia.org/wiki/File:Anscombe%27s_quartet_3.svg Anscombe's Quartet table source: https://multithreaded.stitchfix.com/assets/images/blog/anscombes_quartet_table.png

These materials are licensed under a Creative Commons , attributable to , Johns Hopkins University.
  • Next: Planning a Data Visualization >>
  • Last Updated: Mar 7, 2024 1:06 PM
  • URL: https://guides.library.jhu.edu/datavisualization

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

Praxis Core Math

Course: praxis core math   >   unit 1, data representations | lesson.

  • Data representations | Worked example
  • Center and spread | Lesson
  • Center and spread | Worked example
  • Random sampling | Lesson
  • Random sampling | Worked example
  • Scatterplots | Lesson
  • Scatterplots | Worked example
  • Interpreting linear models | Lesson
  • Interpreting linear models | Worked example
  • Correlation and Causation | Lesson
  • Correlation and causation | Worked example
  • Probability | Lesson
  • Probability | Worked example

introduction to graphical representation of data

What are data representations?

  • How much of the data falls within a specified category or range of values?
  • What is a typical value of the data?
  • How much spread is in the data?
  • Is there a trend in the data over time?
  • Is there a relationship between two variables?

What skills are tested?

  • Matching a data set to its graphical representation
  • Matching a graphical representation to a description
  • Using data representations to solve problems

How are qualitative data displayed?

LanguageNumber of Students
Spanish
French
Mandarin
Latin
  • A vertical bar chart lists the categories of the qualitative variable along a horizontal axis and uses the heights of the bars on the vertical axis to show the values of the quantitative variable. A horizontal bar chart lists the categories along the vertical axis and uses the lengths of the bars on the horizontal axis to show the values of the quantitative variable. This display draws attention to how the categories rank according to the amount of data within each. Example The heights of the bars show the number of students who want to study each language. Using the bar chart, we can conclude that the greatest number of students want to study Mandarin and the least number of students want to study Latin.
  • A pictograph is like a horizontal bar chart but uses pictures instead of the lengths of bars to represent the values of the quantitative variable. Each picture represents a certain quantity, and each category can have multiple pictures. Pictographs are visually interesting, but require us to use the legend to convert the number of pictures to quantitative values. Example Each represents 40 ‍   students. The number of pictures shows the number of students who want to study each language. Using the pictograph, we can conclude that twice as many students want to study French as want to study Latin.
  • A circle graph (or pie chart) is a circle that is divided into as many sections as there are categories of the qualitative variable. The area of each section represents, for each category, the value of the quantitative data as a fraction of the sum of values. The fractions sum to 1 ‍   . Sometimes the section labels include both the category and the associated value or percent value for that category. Example The area of each section represents the fraction of students who want to study that language. Using the circle graph, we can conclude that just under 1 2 ‍   the students want to study Mandarin and about 1 3 ‍   want to study Spanish.

How are quantitative data displayed?

  • Dotplots use one dot for each data point. The dots are plotted above their corresponding values on a number line. The number of dots above each specific value represents the count of that value. Dotplots show the value of each data point and are practical for small data sets. Example Each dot represents the typical travel time to school for one student. Using the dotplot, we can conclude that the most common travel time is 10 ‍   minutes. We can also see that the values for travel time range from 5 ‍   to 35 ‍   minutes.
  • Histograms divide the horizontal axis into equal-sized intervals and use the heights of the bars to show the count or percent of data within each interval. By convention, each interval includes the lower boundary but not the upper one. Histograms show only totals for the intervals, not specific data points. Example The height of each bar represents the number of students having a typical travel time within the corresponding interval. Using the histogram, we can conclude that the most common travel time is between 10 ‍   and 15 ‍   minutes and that all typical travel times are between 5 ‍   and 40 ‍   minutes.

How are trends over time displayed?

How are relationships between variables displayed.

GradeNumber of Students
  • (Choice A)   A
  • (Choice B)   B
  • (Choice C)   C
  • (Choice D)   D
  • (Choice E)   E
  • Your answer should be
  • an integer, like 6 ‍  
  • a simplified proper fraction, like 3 / 5 ‍  
  • a simplified improper fraction, like 7 / 4 ‍  
  • a mixed number, like 1   3 / 4 ‍  
  • an exact decimal, like 0.75 ‍  
  • a multiple of pi, like 12   pi ‍   or 2 / 3   pi ‍  
  • a proper fraction, like 1 / 2 ‍   or 6 / 10 ‍  
  • an improper fraction, like 10 / 7 ‍   or 14 / 8 ‍  

Things to remember

  • When matching data to a representation, check that the values are graphed accurately for all categories.
  • When reporting data counts or fractions, be clear whether a question asks about data within a single category or a comparison between categories.
  • When finding the number or fraction of the data meeting a criteria, watch for key words such as or , and , less than , and more than .

Want to join the conversation?

  • Upvote Button navigates to signup page
  • Downvote Button navigates to signup page
  • Flag Button navigates to signup page

introduction to graphical representation of data

Guide On Graphical Representation of Data – Types, Importance, Rules, Principles And Advantages

introduction to graphical representation of data

What are Graphs and Graphical Representation?

Graphs, in the context of data visualization, are visual representations of data using various graphical elements such as charts, graphs, and diagrams. Graphical representation of data , often referred to as graphical presentation or simply graphs which plays a crucial role in conveying information effectively.

Principles of Graphical Representation

Effective graphical representation follows certain fundamental principles that ensure clarity, accuracy, and usability:Clarity : The primary goal of any graph is to convey information clearly and concisely. Graphs should be designed in a way that allows the audience to quickly grasp the key points without confusion.

  • Simplicity: Simplicity is key to effective data visualization. Extraneous details and unnecessary complexity should be avoided to prevent confusion and distraction.
  • Relevance: Include only relevant information that contributes to the understanding of the data. Irrelevant or redundant elements can clutter the graph.
  • Visualization: Select a graph type that is appropriate for the supplied data. Different graph formats, like bar charts, line graphs, and scatter plots, are appropriate for various sorts of data and relationships.

Rules for Graphical Representation of Data

Creating effective graphical representations of data requires adherence to certain rules:

  • Select the Right Graph: Choosing the appropriate type of graph is essential. For example, bar charts are suitable for comparing categories, while line charts are better for showing trends over time.
  • Label Axes Clearly: Axis labels should be descriptive and include units of measurement where applicable. Clear labeling ensures the audience understands the data’s context.
  • Use Appropriate Colors: Colors can enhance understanding but should be used judiciously. Avoid overly complex color schemes and ensure that color choices are accessible to all viewers.
  • Avoid Misleading Scaling: Scale axes appropriately to prevent exaggeration or distortion of data. Misleading scaling can lead to incorrect interpretations.
  • Include Data Sources: Always provide the source of your data. This enhances transparency and credibility.

Importance of Graphical Representation of Data

Graphical representation of data in statistics is of paramount importance for several reasons:

  • Enhances Understanding: Graphs simplify complex data, making it more accessible and understandable to a broad audience, regardless of their statistical expertise.
  • Helps Decision-Making: Visual representations of data enable informed decision-making. Decision-makers can easily grasp trends and insights, leading to better choices.
  • Engages the Audience: Graphs capture the audience’s attention more effectively than raw data. This engagement is particularly valuable when presenting findings or reports.
  • Universal Language: Graphs serve as a universal language that transcends linguistic barriers. They can convey information to a global audience without the need for translation.

Advantages of Graphical Representation

The advantages of graphical representation of data extend to various aspects of communication and analysis:

  • Clarity: Data is presented visually, improving clarity and reducing the likelihood of misinterpretation.
  • Efficiency: Graphs enable the quick absorption of information. Key insights can be found in seconds, saving time and effort.
  • Memorability: Visuals are more memorable than raw data. Audiences are more likely to retain information presented graphically.
  • Problem-Solving: Graphs help in identifying and solving problems by revealing trends, correlations, and outliers that may require further investigation.

Use of Graphical Representations

Graphical representations find applications in a multitude of fields:

  • Business: In the business world, graphs are used to illustrate financial data, track performance metrics, and present market trends. They are invaluable tools for strategic decision-making.
  • Science: Scientists employ graphs to visualize experimental results, depict scientific phenomena, and communicate research findings to both colleagues and the general public.
  • Education: Educators utilize graphs to teach students about data analysis, statistics, and scientific concepts. Graphs make learning more engaging and memorable.
  • Journalism: Journalists rely on graphs to support their stories with data-driven evidence. Graphs make news articles more informative and impactful.

Types of Graphical Representation

There exists a diverse array of graphical representations, each suited to different data types and purposes. Common types include:

1.Bar Charts:

Used to compare categories or discrete data points, often side by side.

introduction to graphical representation of data

2. Line Charts:

Ideal for showing trends and changes over time, such as stock market performance or temperature fluctuations.

introduction to graphical representation of data

3. Pie Charts:

Display parts of a whole, useful for illustrating proportions or percentages.

introduction to graphical representation of data

4. Scatter Plots:

Reveal relationships between two variables and help identify correlations.

introduction to graphical representation of data

5. Histograms:

Depict the distribution of data, especially in the context of continuous variables.

introduction to graphical representation of data

In conclusion, the graphical representation of data is an indispensable tool for simplifying complex information, aiding in decision-making, and enhancing communication across diverse fields. By following the principles and rules of effective data visualization, individuals and organizations can harness the power of graphs to convey their messages, support their arguments, and drive informed actions.

Download PPT of Graphical Representation

introduction to graphical representation of data

Video On Graphical Representation

FAQs on Graphical Representation of Data

What is the purpose of graphical representation.

Graphical representation serves the purpose of simplifying complex data, making it more accessible and understandable through visual means.

Why are graphs and diagrams important?

Graphs and diagrams are crucial because they provide visual clarity, aiding in the comprehension and retention of information.

How do graphs help learning?

Graphs engage learners by presenting information visually, which enhances understanding and retention, particularly in educational settings.

Who uses graphs?

Professionals in various fields, including scientists, analysts, educators, and business leaders, use graphs to convey data effectively and support decision-making.

Where are graphs used in real life?

Graphs are used in real-life scenarios such as business reports, scientific research, news articles, and educational materials to make data more accessible and meaningful.

Why are graphs important in business?

In business, graphs are vital for analyzing financial data, tracking performance metrics, and making informed decisions, contributing to success.

Leave a comment

Cancel reply.

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

Related Posts

introduction to graphical representation of data

Best Google AdWords Consultants in India...

What is a Google Ads Consultant? A Google Ads Consultant is an expert who specializes in delivering expertise and advice on Google Ads, which is Google’s online advertising medium. Google Ads permits companies to develop and run ads that are visible on Google’s search engine and other Google platforms. The function of a Google Ads […]

introduction to graphical representation of data

Best PPC Consultants in India –...

What Is a PPC Consultant? A PPC consultant or a pay per click consultant is an expert who specializes in handling and optimizing PPC advertisement drives for companies. PPC is a digital marketing model where advertisers pay a price each time their ad is clicked. Standard PPC mediums include Bing Ads, Google Ads, and social media advertisement platforms like […]

introduction to graphical representation of data

Top 20 Generic Digital Marketing Interview...

1. What is Digital Marketing? Digital marketing is also known as online marketing which means promoting and selling products or services to potential customers using the internet and online platforms. It includes email, social media, and web-based advertising, but also text and multimedia messages as a marketing channel. 2. What are the types of Digital […]

introduction to graphical representation of data

Best Social Media Consultants in India...

What Is a Social Media Consultant? A social media advisor is a specialist who delivers direction, recommendation, and assistance linked to the usage of social media for people, companies, or associations. Their prime objective is to support customers effectively by employing social media platforms to gain specific objectives, such as improving brand awareness, entertaining target […]

introduction to graphical representation of data

Gaurav Mittal

Had a great time spent with some awesome learning at The Digital Education Institute. It really helped me to build my career and i am thankful to the institute for making me what i am today.

Company where our students are working

introduction to graphical representation of data

Enroll Now for 2 Hour Free Digital Marketing Class

Lorem Ipsum is simply dummy text of the printing and typesetting industry

Lorem Ipsum is simply dummy text of the printing and typesetting industry . Lorem Ipsum is simply dummy text of the printing and typesetting industry

Introduction to Data Science

Chapter 11 data visualization principles.

We have already provided some rules to follow as we created plots for our examples. Here, we aim to provide some general principles we can use as a guide for effective data visualization. Much of this section is based on a talk by Karl Broman 30 titled “Creating Effective Figures and Tables” 31 and includes some of the figures which were made with code that Karl makes available on his GitHub repository 32 , as well as class notes from Peter Aldhous’ Introduction to Data Visualization course 33 . Following Karl’s approach, we show some examples of plot styles we should avoid, explain how to improve them, and use these as motivation for a list of principles. We compare and contrast plots that follow these principles to those that don’t.

The principles are mostly based on research related to how humans detect patterns and make visual comparisons. The preferred approaches are those that best fit the way our brains process visual information. When deciding on a visualization approach, it is also important to keep our goal in mind. We may be comparing a viewable number of quantities, describing distributions for categories or numeric values, comparing the data from two groups, or describing the relationship between two variables. As a final note, we want to emphasize that for a data scientist it is important to adapt and optimize graphs to the audience. For example, an exploratory plot made for ourselves will be different than a chart intended to communicate a finding to a general audience.

We will be using these libraries:

11.1 Encoding data using visual cues

We start by describing some principles for encoding data. There are several approaches at our disposal including position, aligned lengths, angles, area, brightness, and color hue.

To illustrate how some of these strategies compare, let’s suppose we want to report the results from two hypothetical polls regarding browser preference taken in 2000 and then 2015. For each year, we are simply comparing five quantities – the five percentages. A widely used graphical representation of percentages, popularized by Microsoft Excel, is the pie chart:

Here we are representing quantities with both areas and angles, since both the angle and area of each pie slice are proportional to the quantity the slice represents. This turns out to be a sub-optimal choice since, as demonstrated by perception studies, humans are not good at precisely quantifying angles and are even worse when area is the only available visual cue. The donut chart is an example of a plot that uses only area:

To see how hard it is to quantify angles and area, note that the rankings and all the percentages in the plots above changed from 2000 to 2015. Can you determine the actual percentages and rank the browsers’ popularity? Can you see how the percentages changed from 2000 to 2015? It is not easy to tell from the plot. In fact, the pie R function help file states that:

Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.

In this case, simply showing the numbers is not only clearer, but would also save on printing costs if printing a paper copy:

Browser 2000 2015
Opera 3 2
Safari 21 22
Firefox 23 21
Chrome 26 29
IE 28 27

The preferred way to plot these quantities is to use length and position as visual cues, since humans are much better at judging linear measures. The barplot uses this approach by using bars of length proportional to the quantities of interest. By adding horizontal lines at strategically chosen values, in this case at every multiple of 10, we ease the visual burden of quantifying through the position of the top of the bars. Compare and contrast the information we can extract from the two figures.

Notice how much easier it is to see the differences in the barplot. In fact, we can now determine the actual percentages by following a horizontal line to the x-axis.

If for some reason you need to make a pie chart, label each pie slice with its respective percentage so viewers do not have to infer them from the angles or area:

In general, when displaying quantities, position and length are preferred over angles and/or area. Brightness and color are even harder to quantify than angles. But, as we will see later, they are sometimes useful when more than two dimensions must be displayed at once.

11.2 Know when to include 0

When using barplots, it is misinformative not to start the bars at 0. This is because, by using a barplot, we are implying the length is proportional to the quantities being displayed. By avoiding 0, relatively small differences can be made to look much bigger than they actually are. This approach is often used by politicians or media organizations trying to exaggerate a difference. Below is an illustrative example used by Peter Aldhous in this lecture: http://paldhous.github.io/ucb/2016/dataviz/week2.html .

(Source: Fox News, via Media Matters 34 .)

From the plot above, it appears that apprehensions have almost tripled when, in fact, they have only increased by about 16%. Starting the graph at 0 illustrates this clearly:

Here is another example, described in detail in a Flowing Data blog post:

This plot makes a 13% increase look like a five fold change. Here is the appropriate plot:

Finally, here is an extreme example that makes a very small difference of under 2% look like a 10-100 fold change:

(Source: Venezolana de Televisión via Pakistan Today 36 and Diego Mariano.)

Here is the appropriate plot:

When using position rather than length, it is then not necessary to include 0. This is particularly the case when we want to compare differences between groups relative to the within-group variability. Here is an illustrative example showing country average life expectancy stratified across continents in 2012:

Note that in the plot on the left, which includes 0, the space between 0 and 43 adds no information and makes it harder to compare the between and within group variability.

11.3 Do not distort quantities

During President Barack Obama’s 2011 State of the Union Address, the following chart was used to compare the US GDP to the GDP of four competing nations:

Judging by the area of the circles, the US appears to have an economy over five times larger than China’s and over 30 times larger than France’s. However, if we look at the actual numbers, we see that this is not the case. The actual ratios are 2.6 and 5.8 times bigger than China and France, respectively. The reason for this distortion is that the radius, rather than the area, was made to be proportional to the quantity, which implies that the proportion between the areas is squared: 2.6 turns into 6.5 and 5.8 turns into 34.1. Here is a comparison of the circles we get if we make the value proportional to the radius and to the area:

Not surprisingly, ggplot2 defaults to using area rather than radius. Of course, in this case, we really should not be using area at all since we can use position and length:

11.4 Order categories by a meaningful value

When one of the axes is used to show categories, as is done in barplots, the default ggplot2 behavior is to order the categories alphabetically when they are defined by character strings. If they are defined by factors, they are ordered by the factor levels. We rarely want to use alphabetical order. Instead, we should order by a meaningful quantity. In all the cases above, the barplots were ordered by the values being displayed. The exception was the graph showing barplots comparing browsers. In this case, we kept the order the same across the barplots to ease the comparison. Specifically, instead of ordering the browsers separately in the two years, we ordered both years by the average value of 2000 and 2015.

We previously learned how to use the reorder function, which helps us achieve this goal. To appreciate how the right order can help convey a message, suppose we want to create a plot to compare the murder rate across states. We are particularly interested in the most dangerous and safest states. Note the difference when we order alphabetically (the default) versus when we order by the actual rate:

We can make the second plot like this:

The reorder function lets us reorder groups as well. Earlier we saw an example related to income distributions across regions. Here are the two versions plotted against each other:

The first orders the regions alphabetically, while the second orders them by the group’s median.

11.5 Show the data

We have focused on displaying single quantities across categories. We now shift our attention to displaying data, with a focus on comparing groups.

To motivate our first principle, “show the data”, we go back to our artificial example of describing heights to ET, an extraterrestrial. This time let’s assume ET is interested in the difference in heights between males and females. A commonly seen plot used for comparisons between groups, popularized by software such as Microsoft Excel, is the dynamite plot, which shows the average and standard errors (standard errors are defined in a later chapter, but do not confuse them with the standard deviation of the data). The plot looks like this:

The average of each group is represented by the top of each bar and the antennae extend out from the average to the average plus two standard errors. If all ET receives is this plot, he will have little information on what to expect if he meets a group of human males and females. The bars go to 0: does this mean there are tiny humans measuring less than one foot? Are all males taller than the tallest females? Is there a range of heights? ET can’t answer these questions since we have provided almost no information on the height distribution.

This brings us to our first principle: show the data. This simple ggplot2 code already generates a more informative plot than the barplot by simply showing all the data points:

For example, this plot gives us an idea of the range of the data. However, this plot has limitations as well, since we can’t really see all the 238 and 812 points plotted for females and males, respectively, and many points are plotted on top of each other. As we have previously described, visualizing the distribution is much more informative. But before doing this, we point out two ways we can improve a plot showing all the points.

The first is to add jitter , which adds a small random shift to each point. In this case, adding horizontal jitter does not alter the interpretation, since the point heights do not change, but we minimize the number of points that fall on top of each other and, therefore, get a better visual sense of how the data is distributed. A second improvement comes from using alpha blending : making the points somewhat transparent. The more points fall on top of each other, the darker the plot, which also helps us get a sense of how the points are distributed. Here is the same plot with jitter and alpha blending:

Now we start getting a sense that, on average, males are taller than females. We also note dark horizontal bands of points, demonstrating that many report values that are rounded to the nearest integer.

11.6 Ease comparisons

11.6.1 use common axes.

Since there are so many points, it is more effective to show distributions rather than individual points. We therefore show histograms for each group:

However, from this plot it is not immediately obvious that males are, on average, taller than females. We have to look carefully to notice that the x-axis has a higher range of values in the male histogram. An important principle here is to keep the axes the same when comparing data across two plots. Below we see how the comparison becomes easier:

11.6.2 Align plots vertically to see horizontal changes and horizontally to see vertical changes

In these histograms, the visual cue related to decreases or increases in height are shifts to the left or right, respectively: horizontal changes. Aligning the plots vertically helps us see this change when the axes are fixed:

This plot makes it much easier to notice that men are, on average, taller.

If , we want the more compact summary provided by boxplots, we then align them horizontally since, by default, boxplots move up and down with changes in height. Following our show the data principle, we then overlay all the data points:

Now contrast and compare these three plots, based on exactly the same data:

Notice how much more we learn from the two plots on the right. Barplots are useful for showing one number, but not very useful when we want to describe distributions.

11.6.3 Consider transformations

We have motivated the use of the log transformation in cases where the changes are multiplicative. Population size was an example in which we found a log transformation to yield a more informative transformation.

The combination of an incorrectly chosen barplot and a failure to use a log transformation when one is merited can be particularly distorting. As an example, consider this barplot showing the average population sizes for each continent in 2015:

From this plot, one would conclude that countries in Asia are much more populous than in other continents. Following the show the data principle, we quickly notice that this is due to two very large countries, which we assume are India and China:

Using a log transformation here provides a much more informative plot. We compare the original barplot to a boxplot using the log scale transformation for the y-axis:

With the new plot, we realize that countries in Africa actually have a larger median population size than those in Asia.

Other transformations you should consider are the logistic transformation ( logit ), useful to better see fold changes in odds, and the square root transformation ( sqrt ), useful for count data.

11.6.4 Visual cues to be compared should be adjacent

For each continent, let’s compare income in 1970 versus 2010. When comparing income data across regions between 1970 and 2010, we made a figure similar to the one below, but this time we investigate continents rather than regions.

The default in ggplot2 is to order labels alphabetically so the labels with 1970 come before the labels with 2010, making the comparisons challenging because a continent’s distribution in 1970 is visually far from its distribution in 2010. It is much easier to make the comparison between 1970 and 2010 for each continent when the boxplots for that continent are next to each other:

11.6.5 Use color

The comparison becomes even easier to make if we use color to denote the two things we want to compare:

11.7 Think of the color blind

About 10% of the population is color blind. Unfortunately, the default colors used in ggplot2 are not optimal for this group. However, ggplot2 does make it easy to change the color palette used in the plots. An example of how we can use a color blind friendly palette is described here: http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#a-colorblind-friendly-palette :

There are several resources that can help you select colors, for example this one: http://bconnelly.net/2013/10/creating-colorblind-friendly-figures/ .

11.8 Plots for two variables

In general, you should use scatterplots to visualize the relationship between two variables. In every single instance in which we have examined the relationship between two variables, including total murders versus population size, life expectancy versus fertility rates, and infant mortality versus income, we have used scatterplots. This is the plot we generally recommend. However, there are some exceptions and we describe two alternative plots here: the slope chart and the Bland-Altman plot .

11.8.1 Slope charts

One exception where another type of plot may be more informative is when you are comparing variables of the same type, but at different time points and for a relatively small number of comparisons. For example, comparing life expectancy between 2010 and 2015. In this case, we might recommend a slope chart .

There is no geometry for slope charts in ggplot2 , but we can construct one using geom_line . We need to do some tinkering to add labels. Below is an example comparing 2010 to 2015 for large western countries:

An advantage of the slope chart is that it permits us to quickly get an idea of changes based on the slope of the lines. Although we are using angle as the visual cue, we also have position to determine the exact values. Comparing the improvements is a bit harder with a scatterplot:

In the scatterplot, we have followed the principle use common axes since we are comparing these before and after. However, if we have many points, slope charts stop being useful as it becomes hard to see all the lines.

11.8.2 Bland-Altman plot

Since we are primarily interested in the difference, it makes sense to dedicate one of our axes to it. The Bland-Altman plot, also known as the Tukey mean-difference plot and the MA-plot, shows the difference versus the average:

Here, by simply looking at the y-axis, we quickly see which countries have shown the most improvement. We also get an idea of the overall value from the x-axis.

11.9 Encoding a third variable

An earlier scatterplot showed the relationship between infant survival and average income. Below is a version of this plot that encodes three variables: OPEC membership, region, and population.

We encode categorical variables with color and shape. These shapes can be controlled with shape argument. Below are the shapes available for use in R. For the last five, the color goes inside.

For continuous variables, we can use color, intensity, or size. We now show an example of how we do this with a case study.

When selecting colors to quantify a numeric variable, we choose between two options: sequential and diverging. Sequential colors are suited for data that goes from high to low. High values are clearly distinguished from low values. Here are some examples offered by the package RColorBrewer :

Diverging colors are used to represent values that diverge from a center. We put equal emphasis on both ends of the data range: higher than the center and lower than the center. An example of when we would use a divergent pattern would be if we were to show height in standard deviations away from the average. Here are some examples of divergent patterns:

11.10 Avoid pseudo-three-dimensional plots

The figure below, taken from the scientific literature 38 , shows three variables: dose, drug type and survival. Although your screen/book page is flat and two-dimensional, the plot tries to imitate three dimensions and assigned a dimension to each variable.

Humans are not good at seeing in three dimensions (which explains why it is hard to parallel park) and our limitation is even worse with regard to pseudo-three-dimensions. To see this, try to determine the values of the survival variable in the plot above. Can you tell when the purple ribbon intersects the red one? This is an example in which we can easily use color to represent the categorical variable instead of using a pseudo-3D:

Notice how much easier it is to determine the survival values.

Pseudo-3D is sometimes used completely gratuitously: plots are made to look 3D even when the 3rd dimension does not represent a quantity. This only adds confusion and makes it harder to relay your message. Here are two examples:

11.11 Avoid too many significant digits

By default, statistical software like R returns many significant digits. The default behavior in R is to show 7 significant digits. That many digits often adds no information and the added visual clutter can make it hard for the viewer to understand the message. As an example, here are the per 10,000 disease rates, computed from totals and population in R, for California across the five decades:

state year Measles Pertussis Polio
California 1940 37.8826320 18.3397861 0.8266512
California 1950 13.9124205 4.7467350 1.9742639
California 1960 14.1386471 NA 0.2640419
California 1970 0.9767889 NA NA
California 1980 0.3743467 0.0515466 NA

We are reporting precision up to 0.00001 cases per 10,000, a very small value in the context of the changes that are occurring across the dates. In this case, two significant figures is more than enough and clearly makes the point that rates are decreasing:

state year Measles Pertussis Polio
California 1940 37.9 18.3 0.8
California 1950 13.9 4.7 2.0
California 1960 14.1 NA 0.3
California 1970 1.0 NA NA
California 1980 0.4 0.1 NA

Useful ways to change the number of significant digits or to round numbers are signif and round . You can define the number of significant digits globally by setting options like this: options(digits = 3) .

Another principle related to displaying tables is to place values being compared on columns rather than rows. Note that our table above is easier to read than this one:

state disease 1940 1950 1960 1970 1980
California Measles 37.9 13.9 14.1 1 0.4
California Pertussis 18.3 4.7 NA NA 0.1
California Polio 0.8 2.0 0.3 NA NA

11.12 Know your audience

Graphs can be used for 1) our own exploratory data analysis, 2) to convey a message to experts, or 3) to help tell a story to a general audience. Make sure that the intended audience understands each element of the plot.

As a simple example, consider that for your own exploration it may be more useful to log-transform data and then plot it. However, for a general audience that is unfamiliar with converting logged values back to the original measurements, using a log-scale for the axis instead of log-transformed values will be much easier to digest.

11.13 Exercises

For these exercises, we will be using the vaccines data in the dslabs package:

1. Pie charts are appropriate:

  • When we want to display percentages.
  • When ggplot2 is not available.
  • When I am in a bakery.
  • Never. Barplots and tables are always better.

2. What is the problem with the plot below:

  • The values are wrong. The final vote was 306 to 232.
  • The axis does not start at 0. Judging by the length, it appears Trump received 3 times as many votes when, in fact, it was about 30% more.
  • The colors should be the same.
  • Percentages should be shown as a pie chart.

3. Take a look at the following two plots. They show the same information: 1928 rates of measles across the 50 states.

  • They provide the same information, so they are both equally as good.
  • The plot on the right is better because it orders the states alphabetically.
  • The plot on the right is better because alphabetical order has nothing to do with the disease and by ordering according to actual rate, we quickly see the states with most and least rates.
  • Both plots should be a pie chart.

4. To make the plot on the left, we have to reorder the levels of the states’ variables.

Note what happens when we make a barplot:

Define these objects:

Redefine the state object so that the levels are re-ordered. Print the new object state and its levels so you can see that the vector is not re-ordered by the levels.

5. Now with one line of code, define the dat table as done above, but change the use mutate to create a rate variable and re-order the state variable so that the levels are re-ordered by this variable. Then make a barplot using the code above, but for this new dat .

6. Say we are interested in comparing gun homicide rates across regions of the US. We see this plot:

and decide to move to a state in the western region. What is the main problem with this interpretation?

  • The categories are ordered alphabetically.
  • The graph does not show standarad errors.
  • It does not show all the data. We do not see the variability within a region and it’s possible that the safest states are not in the West.
  • The Northeast has the lowest average.

7. Make a boxplot of the murder rates defined as

by region, showing all the points and ordering the regions by their median rate.

8. The plots below show three continuous variables.

The line \(x=2\) appears to separate the points. But it is actually not the case, which we can see by plotting the data in a couple of two-dimensional points.

Why is this happening?

  • Humans are not good at reading pseudo-3D plots.
  • There must be an error in the code.
  • The colors confuse us.
  • Scatterplots should not be used to compare two variables when we have access to 3.

11.14 Case study: vaccines and infectious diseases

Vaccines have helped save millions of lives. In the 19th century, before herd immunization was achieved through vaccination programs, deaths from infectious diseases, such as smallpox and polio, were common. However, today vaccination programs have become somewhat controversial despite all the scientific evidence for their importance.

The controversy started with a paper 39 published in 1988 and led by Andrew Wakefield claiming there was a link between the administration of the measles, mumps, and rubella (MMR) vaccine and the appearance of autism and bowel disease. Despite much scientific evidence contradicting this finding, sensationalist media reports and fear-mongering from conspiracy theorists led parts of the public into believing that vaccines were harmful. As a result, many parents ceased to vaccinate their children. This dangerous practice can be potentially disastrous given that the Centers for Disease Control (CDC) estimates that vaccinations will prevent more than 21 million hospitalizations and 732,000 deaths among children born in the last 20 years (see Benefits from Immunization during the Vaccines for Children Program Era — United States, 1994-2013, MMWR 40 ). The 1988 paper has since been retracted and Andrew Wakefield was eventually “struck off the UK medical register, with a statement identifying deliberate falsification in the research published in The Lancet, and was thereby barred from practicing medicine in the UK.” (source: Wikipedia 41 ). Yet misconceptions persist, in part due to self-proclaimed activists who continue to disseminate misinformation about vaccines.

Effective communication of data is a strong antidote to misinformation and fear-mongering. Earlier we used an example provided by a Wall Street Journal article 42 showing data related to the impact of vaccines on battling infectious diseases. Here we reconstruct that example.

The data used for these plots were collected, organized, and distributed by the Tycho Project 43 . They include weekly reported counts for seven diseases from 1928 to 2011, from all fifty states. We include the yearly totals in the dslabs package:

We create a temporary object dat that stores only the measles data, includes a per 100,000 rate, orders states by average value of disease and removes Alaska and Hawaii since they only became states in the late 1950s. Note that there is a weeks_reporting column that tells us for how many weeks of the year data was reported. We have to adjust for that value when computing the rate.

We can now easily plot disease rates per year. Here are the measles data from California:

We add a vertical line at 1963 since this is when the vaccine was introduced [Control, Centers for Disease; Prevention (2014). CDC health information for international travel 2014 (the yellow book). p. 250. ISBN 9780199948505].

Now can we show data for all states in one plot? We have three variables to show: year, state, and rate. In the WSJ figure, they use the x-axis for year, the y-axis for state, and color hue to represent rates. However, the color scale they use, which goes from yellow to blue to green to orange to red, can be improved.

In our example, we want to use a sequential palette since there is no meaningful center, just low and high rates.

We use the geometry geom_tile to tile the region with colors representing disease rates. We use a square root transformation to avoid having the really high counts dominate the plot. Notice that missing values are shown in grey. Note that once a disease was pretty much eradicated, some states stopped reporting cases all together. This is why we see so much grey after 1980.

This plot makes a very striking argument for the contribution of vaccines. However, one limitation of this plot is that it uses color to represent quantity, which we earlier explained makes it harder to know exactly how high values are going. Position and lengths are better cues. If we are willing to lose state information, we can make a version of the plot that shows the values with position. We can also show the average for the US, which we compute like this:

Now to make the plot we simply use the geom_line geometry:

In theory, we could use color to represent the categorical value state, but it is hard to pick 50 distinct colors.

11.15 Exercises

Reproduce the image plot we previously made but for smallpox. For this plot, do not include years in which cases were not reported in 10 or more weeks.

Now reproduce the time series plot we previously made, but this time following the instructions of the previous question for smallpox.

For the state of California, make a time series plot showing rates for all diseases. Include only years with 10 or more weeks reporting. Use a different color for each disease.

Now do the same for the rates for the US. Hint: compute the US rate by using summarize: the total divided by total population.

http://kbroman.org/ ↩︎

https://www.biostat.wisc.edu/~kbroman/presentations/graphs2017.pdf ↩︎

https://github.com/kbroman/Talk_Graphs ↩︎

http://paldhous.github.io/ucb/2016/dataviz/index.html ↩︎

http://mediamatters.org/blog/2013/04/05/fox-news-newest-dishonest-chart-immigration-enf/193507 ↩︎

http://flowingdata.com/2012/08/06/fox-news-continues-charting-excellence/ ↩︎

https://www.pakistantoday.com.pk/2018/05/18/whats-at-stake-in-venezuelan-presidential-vote ↩︎

https://www.youtube.com/watch?v=kl2g40GoRxg ↩︎

https://projecteuclid.org/download/pdf_1/euclid.ss/1177010488 ↩︎

http://www.thelancet.com/journals/lancet/article/PIIS0140-6736(97)11096-0/abstract ↩︎

https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6316a4.htm ↩︎

https://en.wikipedia.org/wiki/Andrew_Wakefield ↩︎

http://graphics.wsj.com/infectious-diseases-and-vaccines/ ↩︎

http://www.tycho.pitt.edu/ ↩︎

Illustration with collage of pictograms of clouds, pie chart, graph pictograms on the following

Data visualization is the representation of data through use of common graphics, such as charts, plots, infographics and even animations. These visual displays of information communicate complex data relationships and data-driven insights in a way that is easy to understand.

Data visualization can be utilized for a variety of purposes, and it’s important to note that is not only reserved for use by data teams. Management also leverages it to convey organizational structure and hierarchy while data analysts and data scientists use it to discover and explain patterns and trends.  Harvard Business Review  (link resides outside ibm.com) categorizes data visualization into four key purposes: idea generation, idea illustration, visual discovery, and everyday dataviz. We’ll delve deeper into these below:

Idea generation

Data visualization is commonly used to spur idea generation across teams. They are frequently leveraged during brainstorming or  Design Thinking  sessions at the start of a project by supporting the collection of different perspectives and highlighting the common concerns of the collective. While these visualizations are usually unpolished and unrefined, they help set the foundation within the project to ensure that the team is aligned on the problem that they’re looking to address for key stakeholders.

Idea illustration

Data visualization for idea illustration assists in conveying an idea, such as a tactic or process. It is commonly used in learning settings, such as tutorials, certification courses, centers of excellence, but it can also be used to represent organization structures or processes, facilitating communication between the right individuals for specific tasks. Project managers frequently use Gantt charts and waterfall charts to illustrate  workflows .  Data modeling  also uses abstraction to represent and better understand data flow within an enterprise’s information system, making it easier for developers, business analysts, data architects, and others to understand the relationships in a database or data warehouse.

Visual discovery

Visual discovery and every day data viz are more closely aligned with data teams. While visual discovery helps data analysts, data scientists, and other data professionals identify patterns and trends within a dataset, every day data viz supports the subsequent storytelling after a new insight has been found.

Data visualization

Data visualization is a critical step in the data science process, helping teams and individuals convey data more effectively to colleagues and decision makers. Teams that manage reporting systems typically leverage defined template views to monitor performance. However, data visualization isn’t limited to performance dashboards. For example, while  text mining  an analyst may use a word cloud to to capture key concepts, trends, and hidden relationships within this unstructured data. Alternatively, they may utilize a graph structure to illustrate relationships between entities in a knowledge graph. There are a number of ways to represent different types of data, and it’s important to remember that it is a skillset that should extend beyond your core analytics team.

Use this model selection framework to choose the most appropriate model while balancing your performance requirements with cost, risks and deployment needs.

Register for the ebook on generative AI

The earliest form of data visualization can be traced back the Egyptians in the pre-17th century, largely used to assist in navigation. As time progressed, people leveraged data visualizations for broader applications, such as in economic, social, health disciplines. Perhaps most notably, Edward Tufte published  The Visual Display of Quantitative Information  (link resides outside ibm.com), which illustrated that individuals could utilize data visualization to present data in a more effective manner. His book continues to stand the test of time, especially as companies turn to dashboards to report their performance metrics in real-time. Dashboards are effective data visualization tools for tracking and visualizing data from multiple data sources, providing visibility into the effects of specific behaviors by a team or an adjacent one on performance. Dashboards include common visualization techniques, such as:

  • Tables: This consists of rows and columns used to compare variables. Tables can show a great deal of information in a structured way, but they can also overwhelm users that are simply looking for high-level trends.
  • Pie charts and stacked bar charts:  These graphs are divided into sections that represent parts of a whole. They provide a simple way to organize data and compare the size of each component to one other.
  • Line charts and area charts:  These visuals show change in one or more quantities by plotting a series of data points over time and are frequently used within predictive analytics. Line graphs utilize lines to demonstrate these changes while area charts connect data points with line segments, stacking variables on top of one another and using color to distinguish between variables.
  • Histograms: This graph plots a distribution of numbers using a bar chart (with no spaces between the bars), representing the quantity of data that falls within a particular range. This visual makes it easy for an end user to identify outliers within a given dataset.
  • Scatter plots: These visuals are beneficial in reveling the relationship between two variables, and they are commonly used within regression data analysis. However, these can sometimes be confused with bubble charts, which are used to visualize three variables via the x-axis, the y-axis, and the size of the bubble.
  • Heat maps:  These graphical representation displays are helpful in visualizing behavioral data by location. This can be a location on a map, or even a webpage.
  • Tree maps, which display hierarchical data as a set of nested shapes, typically rectangles. Treemaps are great for comparing the proportions between categories via their area size.

Access to data visualization tools has never been easier. Open source libraries, such as D3.js, provide a way for analysts to present data in an interactive way, allowing them to engage a broader audience with new data. Some of the most popular open source visualization libraries include:

  • D3.js: It is a front-end JavaScript library for producing dynamic, interactive data visualizations in web browsers.  D3.js  (link resides outside ibm.com) uses HTML, CSS, and SVG to create visual representations of data that can be viewed on any browser. It also provides features for interactions and animations.
  • ECharts:  A powerful charting and visualization library that offers an easy way to add intuitive, interactive, and highly customizable charts to products, research papers, presentations, etc.  Echarts  (link resides outside ibm.com) is based in JavaScript and ZRender, a lightweight canvas library.
  • Vega:   Vega  (link resides outside ibm.com) defines itself as “visualization grammar,” providing support to customize visualizations across large datasets which are accessible from the web.
  • deck.gl: It is part of Uber's open source visualization framework suite.  deck.gl  (link resides outside ibm.com) is a framework, which is used for  exploratory data analysis  on big data. It helps build high-performance GPU-powered visualization on the web.

With so many data visualization tools readily available, there has also been a rise in ineffective information visualization. Visual communication should be simple and deliberate to ensure that your data visualization helps your target audience arrive at your intended insight or conclusion. The following best practices can help ensure your data visualization is useful and clear:

Set the context: It’s important to provide general background information to ground the audience around why this particular data point is important. For example, if e-mail open rates were underperforming, we may want to illustrate how a company’s open rate compares to the overall industry, demonstrating that the company has a problem within this marketing channel. To drive an action, the audience needs to understand how current performance compares to something tangible, like a goal, benchmark, or other key performance indicators (KPIs).

Know your audience(s): Think about who your visualization is designed for and then make sure your data visualization fits their needs. What is that person trying to accomplish? What kind of questions do they care about? Does your visualization address their concerns? You’ll want the data that you provide to motivate people to act within their scope of their role. If you’re unsure if the visualization is clear, present it to one or two people within your target audience to get feedback, allowing you to make additional edits prior to a large presentation.

Choose an effective visual:  Specific visuals are designed for specific types of datasets. For instance, scatter plots display the relationship between two variables well, while line graphs display time series data well. Ensure that the visual actually assists the audience in understanding your main takeaway. Misalignment of charts and data can result in the opposite, confusing your audience further versus providing clarity.

Keep it simple:  Data visualization tools can make it easy to add all sorts of information to your visual. However, just because you can, it doesn’t mean that you should! In data visualization, you want to be very deliberate about the additional information that you add to focus user attention. For example, do you need data labels on every bar in your bar chart? Perhaps you only need one or two to help illustrate your point. Do you need a variety of colors to communicate your idea? Are you using colors that are accessible to a wide range of audiences (e.g. accounting for color blind audiences)? Design your data visualization for maximum impact by eliminating information that may distract your target audience.

An AI-infused integrated planning solution that helps you transcend the limits of manual planning.

Build, run and manage AI models. Prepare data and build models on any cloud using open source code or visual modeling. Predict and optimize your outcomes.

Unlock the value of enterprise data and build an insight-driven organization that delivers business advantage with IBM Consulting.                                   

Your trusted Watson co-pilot for smarter analytics and confident decisions.

Use features within IBM Watson® Studio that help you visualize and gain insights into your data, then cleanse and transform your data to build high-quality predictive models.

Data Refinery makes it easy to explore, prepare, and deliver data that people across your organization can trust.

Learn how to use Apache Superset (a modern, enterprise-ready business intelligence web application) with Netezza database to uncover the story behind the data.

Predict outcomes with flexible AI-infused forecasting and analyze what-if scenarios in real-time. IBM Planning Analytics is an integrated business planning solution that turns raw data into actionable insights. Deploy as you need, on-premises or on cloud.

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Mathematics LibreTexts

2.3: Other Graphical Representations of Data

  • Last updated
  • Save as PDF
  • Page ID 130234

  • Kathryn Kozak
  • Coconino Community College

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

There are many other types of graphs. Some of the more common ones are the frequency polygon, the dot plot, the stem plot, scatter plot, and a time-series plot. There are also many different graphs that have emerged lately for qualitative data. Many are found in publications and websites. The following is a description of the stem plot, the scatter plot, and the time-series plot.

Stem plots are a quick and easy way to look at small samples of numerical data. You can look for any patterns or any strange data values. It is easy to compare two samples using stem plots.

The first step is to divide each number into 2 parts, the stem (such as the leftmost digit) and the leaf (such as the rightmost digit). There are no set rules, you just have to look at the data and see what makes sense.

Example \(\PageIndex{1}\) stem plot for grade distribution

The following are the percentage grades of 25 students from a statistics course. Draw a stem plot of the data.

62 87 81 69 87 62 45 95 76 76
62 71 65 67 72 80 40 77 87 58
84 73 93 64 89          
Table \(\PageIndex{1}\): Data of Test Grades

Divide each number so that the tens digit is the stem and the ones digit is the leaf. 62 becomes 6|2.

Make a vertical chart with the stems on the left of a vertical bar. Be sure to fill in any missing stems. In other words, the stems should have equal spacing (for example, count by ones or count by tens). The Graph 2.3.1 shows the stems for this example.

Screenshot (25).png

Now go through the list of data and add the leaves. Put each leaf next to its corresponding stem. Don’t worry about order yet just get all the leaves down.

When the data value 62 is placed on the plot it looks like the plot in Graph 2.3.2 .

Screenshot (21).png

When the data value 87 is placed on the plot it looks like the plot in Graph 2.3.3 .

Screenshot (23).png

Filling in the rest of the leaves to obtain the plot in Graph 2.3.4 .

Screenshot (24).png

Now you have to add labels and make the graph look pretty. You need to add a label and sort the leaves into increasing order. You also need to tell people what the stems and leaves mean by inserting a legend. Be careful to line the leaves up in columns . You need to be able to compare the lengths of the rows when you interpret the graph. The final stem plot for the test grade data is in Graph 2.3.5.

Screenshot (26).png

Now you can interpret the stem-and-leaf display. The data is bimodal and somewhat symmetric. There are no gaps in the data. The center of the distribution is around 70.

You can create a stem and leaf plot on R. the command is:

stem(variable) – creates a stem and leaf plot, if you do not get a stem plot that shows all of the stems then use scale = a number. Adjust the number until you see all of the stems. So you would have stem(variable, scale = a number)

For Example \(\PageIndex{1}\), the command would be

grades<-c(62, 87, 81, 69, 87, 62, 45, 95, 76, 76, 62, 71, 65, 67, 72, 80, 40, 77, 87, 58, 84, 73, 93, 64, 89) stem(grades, scale = 2)

The decimal point is 1 digit(s) to the right of the |

Screenshot (27).png

Now just put a title on the stem plot.

Scatter Plot

Sometimes you have two different variables and you want to see if they are related in any way. A scatter plot helps you to see what the relationship would look like. A scatter plot is just a plotting of the ordered pairs.

Example \(\PageIndex{2}\) scatter plot

Is there any relationship between elevation and high temperature on a given day? The following data are the high temperatures at various cities on a single day and the elevation of the city.

Elevation
(in feet)
7000 4000 6000 3000 7000 4500 5000
Temperature (°F) 50 60 48 70 55 55 60
Table \(\PageIndex{2}\): Data of Temperature versus Elevation

Preliminary: State the random variables

Let x = altitude

y = high temperature

Now plot the x values on the horizontal axis, and the y values on the vertical axis. Then set up a scale that fits the data on each axes. Once that is done, then just plot the x and y values as an ordered pair. In R, the command is:

independent variable<-c(type in data with commas in between values) dependent variable<-c(type in data with commas in between values) plot(independent variable, dependent variable, main="type in a title you want", xlab="type in a label for the horizontal axis", ylab="type in a label for the vertical axis", ylim=c(0, number above maximum y value)

For this example, that would be: elevation<-c(7000, 4000, 6000, 3000, 7000, 4500, 5000) temperature<-c(50, 60, 48, 70, 55, 55, 60) plot(elevation, temperature, main="Temperature versus Elevation", xlab="Elevation (in feet)", ylab="Temperature (in degrees F)", ylim=c(0, 80))

Screenshot (28).png

Looking at the graph, it appears that there is a linear relationship between temperature and elevation. It also appears to be a negative relationship, thus as elevation increases, the temperature decreases.

Time-Series

A time-series plot is a graph showing the data measurements in chronological order, the data being quantitative data. For example, a time-series plot is used to show profits over the last 5 years. To create a time-series plot, the time always goes on the horizontal axis, and the other variable goes on the vertical axis. Then plot the ordered pairs and connect the dots. The purpose of a time-series graph is to look for trends over time. Caution, you must realize that the trend may not continue. Just because you see an increase, doesn’t mean the increase will continue forever. As an example, prior to 2007, many people noticed that housing prices were increasing. The belief at the time was that housing prices would continue to increase. However, the housing bubble burst in 2007, and many houses lost value, and haven’t recovered.

Example \(\PageIndex{3}\) Time-series plot

The following table tracks the weight of a dieter, where the time in months is measuring how long since the person started the diet

Time (months) 0 1 2 3 4 5
Weight (pounds) 200 195 192 193 190 187
Table \(\PageIndex{3}\): Data of Weights versus Time

Make a time-series plot of this data

In R, the command would be:

variable1<-c(type in data with commas in between values, this should be the time variable) variable2<-c(type in data with commas in between values) plot(variable1, variable2, ylim=c(0,number over max), main="type in a title you want", xlab="type in a label for the horizontal axis", ylab="type in a label for the vertical axis") lines(variable1, variable2) – connects the dots

For this example: time<-c(0, 1, 2, 3, 4, 5) weight<-c(200, 195, 192, 193, 190, 187) plot(time, weight, ylim=c(0,250), main="Weight over Time", xlab="Time (Months) ", ylab="Weight (pounds)") ines(time, weight)

Screenshot (29).png

Notice, that over the 5 months, the weight appears to be decreasing. Though it doesn’t look like there is a large decrease.

Be careful when making a graph. If you don’t start the vertical axis at 0, then the change can look much more dramatic than it really is. As an example, Graph 2.3.8 shows the Graph 2.3.7 with a different scaling on the vertical axis. Notice the decrease in weight looks much larger than it really is.

Screenshot (30).png

Exercise \(\PageIndex{1}\)

80 79 89 74 73 67 79
93 70 70 76 88 83 73
81 79 80 85 79 80 79
58 93 94 74      
Table \(\PageIndex{4}\): Data of Test 1 Grades
67 67 76 47 85 70
87 76 80 72 84 98
84 64 65 82 81 81
88 74 87 83    
Table \(\PageIndex{5}\): Data of Test 1 Grades
Length of Metacarpal Height of Person
45 171
51 178
39 157
41 163
48 172
49 183
46 173
43 175
47 173
Table \(\PageIndex{6}\): Data of Metacarpal versus Height
Value Rental Value Rental Value Rental Value Rental
81000 6656 77000 4576 75000 7280 67500 6864
95000 7904 94000 8736 90000 6240 85000 7072
121000 12064 115000 7904 110000 7072 104000 7904
135000 8320 130000 9776 126000 6240 125000 7904
145000 8320 140000 9568 140000 9152 135000 7488
165000 13312 165000 8528 155000 7488 148000 8320
178000 11856 174000 10400 170000 9568 170000 12688
200000 12272 200000 10608 194000 11232 190000 8320
214000 8528 280000 10400 200000 10400 200000 8320
240000 10192 240000 12064 240000 11648 225000 12480
289000 11648 270000 12896 262000 10192 244500 11232
325000 12480 310000 12480 303000 12272 300000 12480
Table \(\PageIndex{7}\): Data of House Value versus Rental
Life Expectancy Fertility Rate Life Expectancy Fertility rate
77.2 1.7 72.3 3.9
55.4 5.8 76.0 1.5
69.9 2.2 66.0 4.2
76.4 2.1 5.9 5.2
75.0 1.8 54.4 6.8
78.2 2.0 62.9 4.7
73.0 2.6 78.3 2.1
70.8 2.8 72.1 2.9
82.6 1.4 80.7 1.4
68.9 2.6 74.2 2.5
81.0 1.5 73.3 1.5
54.2 6.9 67.1 2.4
Table \(\PageIndex{8}\): Data of Life Expectancy versus Fertility Rate
Prenatal Care (%) Health Expenditure (% of GDP)
47.9 9.6
54.6 3.7
93.7 5.2
84.7 5.2
100.0 10.0
42.5 4.7
96.4 4.8
77.1 6.0
58.3 5.4
95.4 4.8
78.0 4.1
93.3 6.0
93.3 9.5
93.7 6.8
89.8 6.1
Table \(\PageIndex{9}\): Data of Prenatal Care versus Health Expenditure
Year 1983 1984 1985 1986 1987 1988 1989 1990
Rate 4.31 4.42 4.52 4.35 4.39 4.21 3.40 3.61
Year 1991 1992 1993 1994 1995 1996 1997  
Rate 3.67 3.61 2.98 2.95 2.72 2.95 2.3  
Table \(\PageIndex{10}\): Data of Year versus Number of Deaths due to Firearms
Date Assets in Billions of AUD
Mar-2006 96.9
Jun-2006 107.4
Sep-2006 107.2
Dec-2006 116.2
Mar-2007 123.7
Jun-2007 134.0
Sep-2007 123.0
Dec-2007 93.2
Mar-2008 93.7
Jun-2008 105.6
Sep-2008 101.5
Dec-2008 158.8
Mar-2009 118.7
Jun-2009 111.9
Sep-2009 87.0
Dec-2009 86.1
Mar-2010 83.4
Jun-2010 85.7
Sep-2010 74.8
Dec-2010 76.0
Mar-2011 75.7
Jun-2011 75.9
Sep-2011 75.2
Dec-2011 87.9
Mar-2012 91.0
Jun-2012 90.1
Sep-2012 83.9
Dec-2012 95.8
Mar-2013 90.5
Table \(\PageIndex{11}\): Data of Date versus RBA Assets
Year CPI-U-RS1 index (December 1977=100) Year CPI-U-RS1 index (December 1977=100)
1947 37.5 1980 127.1
1948 40.5 1981 139.2
1949 40.0 1982 147.6
1950 40.5 1983 153.9
1951 43.7 1984 160.2
1952 44.5 1985 165.7
1953 44.8 1986 168.7
1954 45.2 1987 174.4
1955 45.0 1988 180.8
1956 45.7 1989 188.6
1957 47.2 1990 198.0
1958 48.5 1991 205.1
1959 48.9 1992 210.3
1960 49.7 1993 215.5
1961 50.2 1994 220.1
1962 50.7 1995 225.4
1963 51.4 1996 231.4
1964 52.1 1997 236.4
1965 52.9 1998 239.7
1966 54.4 1999 244.7
1967 56.1 2000 252.9
1968 58.3 2001 260.0
1969 60.9 2002 264.2
1970 63.9 2003 270.1
1971 66.7 2004 277.4
1972 68.7 2005 286.7
1973 73.0 2006 296.1
1974 80.3 2007 304.5
1975 86.9 2008 316.2
1976 91.9 2009 315.0
1977 97.7 2010 320.2
1978 104.4 2011 330.3
1979 114.4    
Table \(\PageIndex{12}\): Data of Time versus CPI
Year Median Income Year Median Income
1967 42,056 1990 49,950
1968 43,868 1991 48,516
1969 45,499 1992 48,117
1970 45,146 1993 47,884
1971 44,707 1994 48,418
1972 46,622 1995 49,935
1973 47,563 1996 50,661
1974 46,057 1997 51,704
1975 44,851 1998 53,582
1976 45,595 1999 54,932
1977 45,884 2000 54,841
1978 47,659 2001 53,646
1979 47,527 2002 53,019
1980 46,024 2003 52,973
1981 45,260 2004 52,788
1982 45,139 2005 53,371
1983 44,823 2006 53,768
1984 46,215 2007 54,489
1985 47,079 2008 52,546
1986 48,746 2009 52,195
1987 49,358 2010 50,831
1988 49,737 2011 50,054
1989 50,624    
Table \(\PageIndex{13}\): Data of Time versus Median Income

Screenshot (31).png

See solutions

Data Sources:

B1 assets of financial institutions. (2013, June 27). Retrieved from www.rba.gov.au/statistics/tables/xls/b01hist.xls

Benen, S. (2011, September 02). [Web log message]. Retrieved from http://www.washingtonmonthly.com/pol...edit031960.php

Capital and rental values of Auckland properties . (2013, September 26). Retrieved from http://www.statsci.org/data/oz/rentcap.html

Contraceptive use . (2013, October 9). Retrieved from http://www.prb.org/DataFinder/Topic/...gs.aspx?ind=35

Deaths from firearms . (2013, September 26). Retrieved from http://www.statsci.org/data/oz/firearms.html

DeNavas-Walt, C., Proctor, B., & Smith, J. U.S. Department of Commerce, U.S. Census Bureau. (2012). Income, poverty, and health insurance coverage in the United States: 2011 (P60-243). Retrieved from website: www.census.gov/prod/2012pubs/p60-243.pdf

Density of people in Africa . (2013, October 9). Retrieved from http://www.prb.org/DataFinder/Topic/...249,250,251,25 2,253,254,34227,255,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,27 2,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,294, 295,296,297,298,299,300,301,302,304,305,306,307,308

Department of Health and Human Services, ASPE. (2013). Health insurance marketplace premiums for 2014. Retrieved from website: aspe.hhs.gov/health/reports/2...b_premiumsland scape.pdf

Electricity usage . (2013, October 9). Retrieved from http://www.prb.org/DataFinder/Topic/...s.aspx?ind=162

Fertility rate. (2013, October 14). Retrieved from http://data.worldbank.org/indicator/SP.DYN.TFRT.IN

Fuel oil usage. (2013, October 9). Retrieved from http://www.prb.org/DataFinder/Topic/...s.aspx?ind=164

Gas usage. (2013, October 9). Retrieved from http://www.prb.org/DataFinder/Topic/...s.aspx?ind=165

Health expenditure. (2013, October 14). Retrieved from http://data.worldbank.org/indicator/SH.XPD.TOTL.ZS Hinatov, M. U.S. Consumer Product Safety Commission, Directorate of Epidemiology. (2012). Incidents, deaths, and in-depth investigations associated with non-fire carbon monoxide from engine-driven generators and other engine-driven tools, 1999-2011. Retrieved from website: www.cpsc.gov/PageFiles/129857/cogenerators.pdf

Life expectancy at birth. (2013, October 14). Retrieved from http://data.worldbank.org/indicator/SP.DYN.LE00.IN

Median income of males. (2013, October 9). Retrieved from http://www.prb.org/DataFinder/Topic/...s.aspx?ind=137

Median income of males. (2013, October 9). Retrieved from http://www.prb.org/DataFinder/Topic/...s.aspx?ind=136

Prediction of height from metacarpal bone length . (2013, September 26). Retrieved from http://www.statsci.org/data/general/stature.html

Pregnant woman receiving prenatal care. (2013, October 14). Retrieved from http://data.worldbank.org/indicator/SH.STA.ANVC.ZS

United States unemployment. (2013, October 14). Retrieved from http://www.tradingeconomics.com/unit...mployment-rate

Weissmann, J. (2013, March 20). A truly devastating graph on state higher education spending. The Atlantic. Retrieved from http://www.theatlantic.com/business/...ending/274199/

Graphical summaries of data #

Many powerful approaches to data analysis communicate their findings via graphs. These are an important counterpart to data analysis approaches that communicate their findings via numbers or tabless.

Here we will illustrate some of the most common approaches for graphical data analysis. Throughout this discussion, it is important to remember that graphical data analysis methods are subject to the same principles as non-graphical methods. A graph can be either informative or misleading, just like any other type of statistical result. To understand whether a graph is informative, we should consider the following:

Every graph should provide insight into the specific research question that is the overall goal of the data analysis.

The graph is constructed using a sample of data, but the purpose of the graph is to learn about the population that the sample represents.

What statistical principal or concept is the graph based on?

What are the theoretical properties of any numerical summaries that are shown in the graph?

Almost every statistical graphic conveys a statistical concept that can be defined in a non-graphical manner. Graphs may show associations, location, dispersion, tails, conditioning, or almost any other statistical feature of the data or population. Graphs make it easier for the viewer to digest such information, but when interpreting a graph it is always important to keep in mind the specific statistical concept on which the graph is based.

Statistical graphics have an aesthetic dimension that is usually not evident when presenting findings through, say, tables. Our goal here is to focus on the content of graphs, not their aesthetic properties. Very crude graphs that have deep content are much more informative than beautiful graphs that convey only superficial content. In recent years, the field of infographics has grown rapidly. There is no sharp line dividing infographics from statistical graphs, however in general, the former tend to convey relatively simple insights in an aesthetically engaging way, while the latter aim to convey deeper and more subtle insight, with less focus on presentation.

Challenges and limitations of graphs #

One of the main challenges in statistical graphics is to fit the greatest amount of useful information into a single graph, while allowing the graph to remain interpretable. More complex graphs can suffer from overplotting , in which the plot elements are so crowded on the page that they fall on top of each other. This can limit the legibility of plots formed from large datasets unless a great deal of preliminary summarization of the data is performed.

Another challenge that arises in graphing complex datasets is that most graphs are two-dimensional, so that they can be viewed on a screen (or printed on paper). Some graphing techniques extend to three dimensions, but many datasets have a natural dimensionality that is much greater than 2 or 3. A few methods for graphing work around this, but require more effort from the person viewing the graph.

Boxplots are a graphical representation of the distribution of a single quantitative variable. A boxplot is based on a set of quantiles calculated using a sample of data. Below is an example of a single boxplot, drawn horizontally, showing the distribution of income values based on a sample of 100 individuals.

The “box” in a boxplot (shaded blue above) spans from the 25th to the 75th percentiles of the data, with an additional line drawn cross-wise through the box at the median. “Whiskers” extend from either end of the box, and are intended to cover the range of the data, excluding “outliers”.

The concept of an outlier is extremely problematic and no generically useful definition of outliers has been proposed. For the purpose of drawing a boxplot, the most common convention is to plot the upper (right-most) whisker at the 75th percentile plus 1.5 times the IQR, or to the greatest data value less than this quantity. Analogously, the lower (left-most) whisker is drawn at the 25th percentile minus 1.5 times the IQR, or to the least data value greater than this quantity. Finally, individual points sometimes called “fliers” are drawn corresponding to any value that falls outside the range spanned by the whiskers. A single box-plot, as above, is often drawn horizontally, but may also be drawn vertically.

There are many alternative ways of defining the locations of the whiskers in a boxplot. The approach described above is most common, and is chosen so that with “light tailed” distributions, well under 1% of the data will fall outside of of the whiskers.

The boxplot above shows a right-skewed distribution. This is evident because the upper whisker is further from the box than the lower whisker. Also, within the box, the median is closer to the lower side of the box than to the upper side of the box. Overall, the lower quantiles are more compressed, and the upper quantiles are more spread out, which is a feature of right-skewed distributions.

Side-by-side boxplots #

Boxplots are commonly used to compare distributions. A “side-by-side” or “grouped” boxplot is a collection of boxplots drawn for different subsets of data, plotted on the same axes. These subsets usually result from a stratification of the data, according to some stratifying factor that partially accounts for the heterogeneity within the population of interest. For example, below we consider boxplots showing the distribution of income, stratified by sex.

Histograms #

A histogram is a very familiar way to visualize quantitative data. A histogram is constructed by breaking the range of the values into bins and counting the number (or proportion) of observations that fall into each bin. The shape of a histogram shows visually how likely we are to observe data value in each part of the range. We are more likely to observe values where the histogram bars are higher, and less likely to observe values where the histogram bars are lower.

Histograms closely resemble “bar charts”, but with the added statistical aspect that the goal is to capture the density at each possible point in the population. “Density” is a measure of how commonly we observe data “near”, rather than “at” a point. For example, the density of household incomes at 45,000 USD would not be the exact number or frequency of households with this income. Instead, it reflects the frequency of households that have an income near 45,000 USD.

A histogram can be used to assess almost any property of a distribution. The common measures of location and dispersion can be judged from visual inspection of the histogram. As always, we should remember that features of the histogram may not always reflect features of the population from which the data were sampled. For example, a histogram may show two modes (i.e. is bimodal ) even when the underlying distribution only has one mode (i.e. is unimodal ). Moreover, the number of modes in a histogram can change as the bin width is varied.

Histograms are easy to communicate about, but may not be effective when working with small samples, where they can accentuate non-generalizable features of the sample (i.e. characteristics of the sample that are not present in the population). This is reflected in the following mathematical fact. For many statistics, if we wish to reduce the error relative to the population value of the statistic by a factor of two, we need to increase the sample size by a factor of four. In the case where we are aiming to estimate a density, in order to reduce the error by a factor of two, we need to increase the sample size by a factor of eight.

With a sufficiently large collection of representative data, the histogram should closely match the population’s probability density function (PDF). The PDF is usually a smooth curve, rather than a series of steps as in a histogram. This fact inspired the development of a modified version of a histogram that presents us with a smooth curve instead of a series of steps. This technique is called kernel density estimation ( KDE ). It produces graphs such as shown below.

Kernel density estimates may provide a somewhat more accurate estimation of the underlying density function compared to a histogram. But like a histogram, they can be unstable and produce artifacts. For example, the KDE above shows positive density for negative income values, even though all of the income values used to fit the KDE were positive (in some cases, income can take a negative value, but in this case no such values were present). More advanced KDE methods not used here can mitigate this issue.

One advantage of using a KDE rather than a histogram is that it is easier to overlay multiple KDEs on the same axes for comparison without too much overplotting. This might allow us to compare, say, the distributions of female and male incomes as follows.

Quantile plots #

A quantile plot is a plot of the pairs \((p, q_p)\) , where \(q_p\) is the p’th quantile of a collection of quantitative values. Since \(p\) can be any real number between 0 and 1, the graph of these pairs constitutes a function. By construction, this must be a non-decreasing function. A quantile plot contains essentially the same information as a histogram, but is represented in a very different way. Note that unlike the histogram, for which the bin width is a parameter that must be selected, there is no such parameter in the quantile plot. Arguably, the quantile plot is a more stable and informative summary of a sample, especially if the sample size is moderate. However most people are more comfortable interpreting histograms than quantile functions.

As an example, the following plot shows simulated systolic blood pressure values for a sample of females and a sample of males. In this case, at every probability point \(p\) , the blood pressure quantile for males is greater than the blood pressure quantile for females, indicating that male blood pressure is “stochastically greater” than female blood pressure.

Below is another example that shows two quantile functions, but in this case the quantile functions cross. As a result, there is no “stochastic ordering” between the data for females and for males. Also note that the quantile curve for females is steeper than the curve for males, indicating that the female blood pressure values are more dispersed than those for the males.

Quantile-quantile plots #

A quantile-quantile plot , or QQ plot , is a plot based on quantiles that is used to compare two distributions. Recall that a quantile plot plots the pairs \((p, q_p)\) for one sample. A QQ plot plots the pairs \((q^{(1)}_p, q^{(2)}_p)\) , where \(q^{(1)}_p\) are the quantiles for the first sample, and \(q^{(2)}_p\) are the quantiles for the second sample. In a QQ-plot, the value of p is “implicit” – each point on the graph corresponds to a specific value of p, but you cannot see what this value is by inspecting the graph.

As an example, suppose we are comparing the number of minutes of sleep during one night for teenagers and adults. This might give us the following QQ-plot:

The above QQ-plot shows us that teenagers tend to sleep longer than adults, and this is especially true at the upper end of the range. The QQ-plot approximately passes through the point (600, 800), meaning that for some probability p, 600 is the p’th quantile for adults and 800 is the p’th quantile for teenagers.

The slope of the curve in the QQ-plot reflects the relative levels of dispersion in the two distributions being compared. Since the slope of the curve in the above QQ-plot is greater than that of the diagonal reference line, it follows that the values plotted on the vertical axis (teenager’s values) are more dispersed than the values plotted on the horizontal axis (adult’s values).

An important property of a QQ-plot is that if the plot shows a linear relationship between the quantiles, then the two distributions are related via a location/scale transformation . That is, there is a linear function \(a + bx\) that maps one distribution to the other. In the example above, there is a substantial amount of curvature in the graph, so it does not seem to be the case that the sleep durations for adults and teenagers are related via a location/scale transformation.

Dot plots #

Dot plots display quantitative data that are stratified into groups. One axis of the plot is used to display the quantitative measure, and the other axis is used to separate the results for different groups. A series of parallel “guide lines” are used to show which points belong to each group. Dot plots are often used to display a collection of numerical summary statistics in visual form. Sometimes people say that dot plots are used to “convert tables into graphs”. Due to overplotting, dot plots are less commonly used to show raw data. The example below shows how dot plots can be used to display the median age stratified by sex, for people living in each of eleven countries.

The plot above shows that the median age for females is greater than the median age for males in every country. This is mainly due to females having longer life expectancies than males. We also see that some countries have much lower median ages for both sexes compared to other countries. Countries that have recently had high birth rates, such as Ethiopia and Nigeria, tend to have much lower median ages than countries with lower birth rates, such as Japan.

Scatterplots #

A scatterplot is a very widely-used method for visualizing bivariate data. They have many uses, but the most relevant for us is to plot the joint (empirical) distribution of two quantitative values. As an example, suppose that we observe paired data values giving the annual minimum and annual maximum temperature at a location. We could view these data with a scatterplot, placing, say, the minimum temperature value on the horizontal (x) axis, and the maximum temperature value on the vertical (y) axis. The number of points is the sample size, here being the number of locations for which temperature data are available. A possible graph of this type is shown below.

Several characteristics of the relationship between minimum and maximum temperature are evident from the plot above. The maximum temperature at each location is at least as large as the minimum temperature. There is a positive association in which locations with a lower minimum temperature tend to have a lower maximum temperature compared to places with a higher maximum temperature, but there is a lot of scatter around this trend. Warmer places tend to have a smaller range between their minimum and maximum temperatures. Concretely, locations on the equator and at low elevation, such as Singapore, have relatively constant temperature throughout the year. Locations near the center of large continents, like Winnipeg, Canada, can have extremely cold winters and also rather hot summers. Coastal regions that are far from the equator, such as Dublin, Ireland, have mild winters and cool summers.

To aid in interpreting a scatterplot, it is useful to plot a smooth curve that runs through the center of the data. This is called scatterplot smoothing , and can be accomplished with several algorithms, one of which is known as lowess . The population analogue of a scatterplot smooth is the conditional mean , or conditional expectation , denoted \(E[Y|X=x]\) , for the conditional mean of \(Y\) given \(X\) . The conditional mean is a function of \(x\) , and can be evaluated at any point \(x\) in the domain of \(X\) . The conditional mean is (roughly speaking), the average of all values of \(Y\) whose corresponding value of \(X\) is near \(x\) .

The plot below adds the estimated conditional mean (orange curve) to the scatterplot of temperature data discussed above. The conditional mean curve is increasing, showing that, as noted above, a location with lower annual minimum temperature tends on average to have a lower annual maximum temperature (relative to other locations).

Time series plots #

Some data have a serial structure, meaning that the values are observed with an ordering. Very often, such observations are made over time, which gives us time series or longitudinal data. Sometimes we observe a single time series over a long period of time, such as the value of a commodity in a market recorded every day over many years. Other times, we observe many short time series recorded irregularly. We may plot these time series together, leading to what is sometimes called a “spaghetti plot”. For example, in a study of human growth, we may observe measurements of the body weight of research subjects at various ages, giving us the spaghetti plot below:

Parallel coordinate plots #

Scatterplots in the plane are limited to two dimensions. Various techniques have been developed to overcome this limitation, one of which is the parallel coordinate plot . A parallel coordinate plot places the coordinate axes for the multiple dimensions as parallel lines, rather than as perpendicular lines. Using parallel lines means that data for far more than two or three variables can be placed on a single page.

Below is an example of a parallel coordinates plot, showing four attributes of a set of ten countries. A scatterplot of these points would live in four-dimensional space, which is quite challenging to visualize directly. Note that the attributes are converted to Z-scores, which is common in a parallel coordinates plot when the variables being plotted fall in very different ranges. The plot shows us that the life expectancies for females and for males are quite similar – the country with the highest life expectancy for females also has the highest life expectancy for males, and the country with the lowest life expectancy for females also has the lowest life expectancy for males. There is also a substantial positive relationship between the economic status of a country, as measured by its gross domestic product (GDP) and life expectancy. However no relationship is evident between GDP and population, or between either of the life expectancy variables and population.

Mosaic plots #

The graphs above primarily use quantitative data. A mosaic plot is a plot that is used with nominal data. Specifically mosaic plots are used when the units of analysis are cross-classified according to two nominal factors. In the example below, people with cancer are cross-classified by their biological sex, and by the type of cancer that they have:

The width of each box in the mosaic plot corresponds to the relative overall prevalence of the corresponding cancer type. The heights of the boxes correspond to the sex-specific prevalences. Based on this graph, we see that digestive, lung, and breast cancers are much more common than, say, oral and endocrine cancers. The mosaic plot also shows us that while breast and endocrine cancers are more common in females, the other cancer types are more common in males.

An important property of a mosaic plot is that the area of each box is proportional to the number of units that fall into the box. Thus, we can see that the area of the female breast cancer box is larger than the the combined areas of the female and male lung cancer boxes. Thus, there are more cases of breast cancer in females than the combined cases of lung cancer for both sexes.

introduction to graphical representation of data

Perception-Inspired Graph Convolution for Music Understanding Tasks

This article discusses musgconv, a perception-inspired graph convolution block for symbolic musical applications.

Emmanouil Karystinaios

Emmanouil Karystinaios

Towards Data Science

Introduction

In the field of Music Information Research (MIR), the challenge of understanding and processing musical scores has continuously been introduced to new methods and approaches. Most recently many graph-based techniques have been proposed as a way to target music understanding tasks such as voice separation, cadence detection, composer classification, and Roman numeral analysis.

This blog post covers one of my recent papers in which I introduced a new graph convolutional block, called MusGConv , designed specifically for processing music score data. MusGConv takes advantage of music perceptual principles to improve the efficiency and the performance of graph convolution in Graph Neural Networks applied to music understanding tasks.

Understanding the Problem

Traditional approaches in MIR often rely on audio or symbolic representations of music. While audio captures the intensity of sound waves over time, symbolic representations like MIDI files or musical scores encode discrete musical events. Symbolic representations are particularly valuable as they provide higher-level information essential for tasks such as music analysis and generation.

However, existing techniques based on symbolic music representations often borrow from computer vision (CV) or natural language processing (NLP) methodologies. For instance, representing music as a “pianoroll” in a matrix format and treating it similarly to an image, or, representing music as a series of tokens and treating it with sequential models or transformers. These approaches, though effective, could fall short in fully capturing the complex, multi-dimensional nature of music, which includes hierarchical note relation and intricate pitch-temporal relationships. Some recent approaches have been proposed to model the musical score as a graph and apply Graph Neural Networks to solve various tasks.

The Musical Score as a Graph

The fundamental idea of GNN-based approaches to musical scores is to model a musical score as a graph where notes are the vertices and edges are built from the temporal relations between the notes. To create a graph from a musical score we can consider four types of edges (see Figure below for a visualization of the graph on the score):

  • onset edges : connect notes that share the same onset;
  • consecutive edges (or next edges ): connect a note x to a note y if the offset of x corresponds to the onset of y;
  • during edges: connect a note x to a note y if the onset of y falls within the onset and offset of x;
  • rest edges (or silence edges ): connect the last notes before a rest to the first ones after it.

A GNN can treat the graph created from the notes and these four types of relations.

Introducing MusGConv

MusGConv is designed to leverage music score graphs and enhance them by incorporating principles of music perception into the graph convolution process. It focuses on two fundamental dimensions of music: pitch and rhythm, considering both their relative and absolute representations.

Absolute representations refer to features that can be attributed to each note individually such as the note’s pitch or spelling, its duration or any other feature. On the other hand, relative features are computed between pairs of notes, such as the music interval between two notes, their onset difference, i.e. the time on which they occur, etc.

Key Features of MusGConv

  • Edge Feature Computation : MusGConv computes edge features based on the distances between notes in terms of onset, duration, and pitch. The edge features can be normalized to ensure they are more effective for Neural Network computations.
  • Relative and Absolute Representations : By considering both relative features (distance between pitches as edge features) and absolute values (actual pitch and timing as node features), MusGConv can adapt and use the representation that is more relevant depending on the occasion.
  • Integration with Graph Neural Networks : The MusGConv block integrates easily with existing GNN architectures with almost no additional computational cost and can be used to improve musical understanding tasks such as voice separation, harmonic analysis, cadence detection, or composer identification.

The importance and coexistence of the relative and absolute representations can be understood from a transpositional perspective in music. Imagine the same music content transposed. Then, the intervalic relations between notes stay the same but the pitch of each note is altered.

Understanding Message Passing in Graph Neural Networks (GNNs)

To fully understand the inner workings of the MusGConv convolution block it is important to first explain the principles of Message Passing.

What is Message Passing?

In the context of GNNs, message passing is a process where vertices within a graph exchange information with their neighbors to update their own representations. This exchange allows each node to gather contextual information from the graph, which is then used to for predictive tasks.

The message passing process is defined by the following steps:

  • Initialization : Each node is assigned to a feature vector, which can include some important properties. For example in a musical score, this could include pitch, duration, and onset time for each node/note.
  • Message Generation : Each node generates a message to send to its neighbors. The message typically includes the node’s current feature vector and any edge features that describe the relationship between the nodes. A message can be for example a linear transformation of the neighbor’s node features.
  • Message Aggregation : Each node collects messages from its neighbors. The aggregation function is usually a permutation invariant function such as sum, mean, or max and it combines these messages into a single vector, ensuring that the node captures information from its entire neighborhood.
  • Node Update : The aggregated message is used to update the node’s feature vector. This update often involves applying a neural network layer (like a fully connected layer) followed by a non-linear activation function (such as ReLU).
  • Iteration : Steps 2–4 are repeated for a specified number of iterations or layers, allowing information to propagate through the graph. With each iteration, nodes incorporate information from progressively larger neighborhoods.

Message Passing in MusGConv

MusGConv alters the standard message passing process mainly by incorporating both absolute features as node features and relative musical features as edge features. This design is tailored to fit the nature of musical data.

The MusGConv convolution is defined by the following steps:

  • Edge Features Computation : In MusGConv, edge features are computed as the difference between notes in terms of onset, duration, and pitch. Additionally, pitch-class intervals (distances between notes without considering the octave) are included, providing an reductive but effective method to quantify music intervals.
  • Message Computation : The message within the MusGConv includes the source node’s current feature vector but also the afformentioned edge features from the source to the destination node, allowing the network to leverage both absolute and relative information of the neighbors during message passing.
  • Aggregation and Update : MusGConv uses sum as the aggregation function, however, it concatenates the current node representation with the sum of its neighbor messages.

By designing the message passing mechanism in this way, MusGConv attempts to preserve the relative perceptual properties of music (such as intervals and rhythms), leading to more meaningful representations of musical data.

Should edge features are absent or deliberately not provided then MusGConv computes the edge features between two nodes as the absolute difference between their node features. The version of MusGConv with the edges features is named MusGConv(+EF) in the experiments.

Applications and Experiments

To demonstrate the potential of MusGConv I discuss below the tasks and the experiments conducted in the paper. All models independent of the task are designed with the pipeline shown in the figure below. When MusGConv is employed the GNN blocks are replaced by MusGConv blocks.

I decided to apply MusGConv to four tasks: voice separation, composer classification, Roman numeral analysis, and cadence detection. Each one of these tasks presents a different taxonomy from a graph learning perspective. Voice separation is a link prediction task, composer classification is a global classification task, cadence detection is a node classification task, and Roman numeral analysis can be viewed as a subgraph classification task. Therefore we are exploring the suitability of MusGConv not only from a musical analysis perspective but through out the spectrum of graph deep learning task taxonomy.

Voice Separation

Voice separation is the detection of individual monophonic streams within a polyphonic music excerpt. Previous methods had employed GNNs to solve this task. From a GNN perspective, voice separation can be viewed as link prediction task, i.e. for every pair of notes we predict if they are connected by an edge or not. The product the link prediction process should be a graph where consecutive notes in the same voice are ought to be connected. Then voices are the connected components of the predicted graph. I point the readers to this paper for more information on voice separation using GNNs.

For voice separation the pipeline of the above figure applies to the GNN encoder part of the architecture. The link prediction part takes place as the task specific module of the pipeline. To use MusGConv it is sufficient to replace the convolution blocks of the GNN encoder with MusGConv. This simple substitution results in more accurate prediction making less mistakes.

Since the interpretation of deep learning systems is not exactly trivial, it is not easy to pinpoint the reason for the improved performance. From a musical perspective consecutive notes in the same voice should tend to have smaller relative pitch difference. The design of MusGConv definitely outlines the pitch differences with the relative edge features. However, I would need to also say, from individual observations that music does not strictly follow any rules.

Composer Classification

Composer classification is the process of identifying a composer based on some music excerpt. Previous GNN-based approaches for this task receive a score graph as input similarly to the pipeline shown above and then they include some global pooling layer that collapses the graph of the music excerpt to a vector. From that vector then the classification process applied where classes are the predefined composers.

Yet again, MusGConv is easy to implement by replacing the GNN convolutional blocks. In the experiments, using MusGConv was indeed very beneficial in solving this task. My intuition is that relative features in combination with the absolute give better insights to compositional style.

Roman Numeral Analysis

Roman numeral analysis is a method for harmonic analysis where chords are represented as Roman numerals. The task for predicting the Roman numerals is a fairly complex one. Previous architectures used a mixture of GNNs and Sequential models. Additionally, Roman numeral analysis is a multi-task classification problem, typically a Roman numeral is broken down to individual simpler tasks in order to reduce the class vocabulary of unique Roman numerals. Finally, the graph-based architecture of Roman numeral analysis also includes a onset contraction layer after the graph convolution that transforms the graph to an ordered sequence. This onset contraction layer, contracts groups of notes that occur at the same time and they are assigned to the same label during classification. Therefore, it can be viewed as a subgraph classification task. I would reckon that the explication of this model would merit its own post, therefore, I would suggest reading the paper for more insights.

Nevertheless, the general graph pipeline in the figure is still applicable. The sequential models together with the multitask classification process and the onset contraction module entirely belong to the task-specific box. However, replacing the Graph Convolutional Blocks with MusGConv blocks does not seem to have an effect on this task and architecture. I attribute this to the fact that the task and the model architecture are simply too complex.

Cadence Detection

Finally, let’s discuss cadence detection. Detecting cadences can be viewed as similar to detecting phrase endings and it is an important aspect of music analysis. Previous methods for cadence detection employed GNNs with an encoder-decoder GNN architecture. Each note which by now we know that also corresponds to one node in the graph is classified to being a cadence note or not. The cadence detection task includes a lot of peculiarities such as very heavy class imbalances as well as annotation ambiguities. If you are interested I would again suggest to check out this paper .

The use of MusGConv convolution in the encoder of can be beneficial for detecting cadences. I believe that the combination of relative and absolute features and the design of MusGConv can keep track of voice leading patterns that often occur around cadences.

Results and Evaluation

Extensive experiments have shown that MusGConv can outperform state-of-the-art models across the aforementioned music understanding tasks. The table below summarizes the improvements:

However soulless a table can be, I would prefer not to fully get into any more details in the spirit of keeping this blog post lively and towards a discussion. Therefore, I invite you to check out the original paper for more details on the results and datasets.

Summary and Discussion

MusGConv is a graph convolutional block for music. It offers a simple perception-inspired approach to graph convolution that results to performance improvement of GNNs when applied to music understanding tasks. Its simplicity is the key to its effectiveness. In some tasks, it is very beneficial is some others not so much. The inductive bias of the relative and absolute features in music is a neat trick to magically improve your GNN results but my advice is to always take it with a pinch of salt. Try out MusGConv by all means but also do not forget about all the other cool graph convolutional block possibilities.

If you are interested in trying MusGConv , the code and models are available on GitHub .

Notes and Acknowledgments

All images in this post are by the author. I would like to thank Francesco Foscarin my co-author of the original paper for his contributions to this work.

Emmanouil Karystinaios

Written by Emmanouil Karystinaios

Ph.D. Student at Johannes Kepler University

Text to speech

C# Corner

  • TECHNOLOGIES
  • An Interview Question

Coding Best Practices

Programming in Practice - GUI - XAML - Description of the User Interface

introduction to graphical representation of data

  • Mariusz Postol
  • Jul 09, 2024
  • Other Artcile

This article concerns selected issues related to the representation of process information in graphical form to develop a comprehensive User Interface. It presents XAML Domain-Specific Language as a description of the user interface.

Introduction

In this article, we continue the series dedicated to discussing selected issues related to the representation of process information in graphical form. The main goal is to address selected topics in the context of graphics, which is used as a kind of control panel for the business process. It is the third article related to GUI development. If you are interested in this topic you may be interested to check out also previous articles.

  • Programming in Practice - Graphical User Interface (GUI)
  • Programming in Practice - GUI - MVVM Program Design Pattern

The discussion is backed by the example code gathered in the GitHub repository. To follow the discussion in this respect open in MS Visual Studio(TM) the ExDataManagement.sln solution. All examples are available in the 5.13-Juliet tag . All the examples in concern have been added to the GraphicalData folder.

In This Article,

XML Meaning

Partial class, conversion of xaml to csharp, xaml semantics, rendering types.

  • Bindings - User Interface Interoperability
  • DataContext

INotifyPropertyChange

Program bootstrap.

This article concerns selected issues related to the representation of process information in graphical form to develop a comprehensive User Interface. It presents XAML Domain-Specific Language as a description of the user interface. It is a contribution to Programming in Practice External Data topics. A sample program backs all topics.

An image is a composition of colored pixels. They must be composed in such a way as to represent selected process information, i.e. its state or behavior. Similarly to the case of data residing in memory, which we do not process by directly referring to their binary representation, we do not create a Graphical User Interface (GUI for short) by laboriously assembling pixels into a coherent composition. Moreover, the GUI is a dashboard controlling the process, so it must also behave dynamically, including enabling data entry and triggering commands.

In a computer-centric environment, generating such graphics requires a formal description. In this article, a dedicated domain-specific language called Extensible Application Markup Language (XAML for short) is examined. By design, it is used to describe formally what we see on the screen. A new language sounds disturbing - especially since learning this language is beyond the scope of this publication. Fortunately, in-depth knowledge of it is not required. This is not a necessary condition to understand any of the topics in concern. The main goal is to examine selected topics bounded to generating a graphical user interface based on its formal description, which we programmers can somehow integrate into the entire program.

However, how to ensure the appropriate level of abstraction, i.e. hide the details related to the rendering of the image and not lose the ability to keep it under control. As usual, for our considerations to be based on practical examples we must use a specific technology. I chose the Windows Presentation Foundation (WPF). Technology refers to the tools, techniques, and processes used to design, develop, test, and maintain software systems. This encompasses a range of elements, including programming languages, development tools, frameworks and libraries, best practice rules, patterns, and concepts. Still, I will try to ensure that we do not lose the generality of the considerations regardless of this selection. An important component of this technology is the XAML language, which we will use to achieve an appropriate level of abstraction. Hopefully, we will stay as close as possible to the practice of using the CSharp language to deploy a Graphical User Interface.

XML-Based Application Markup Language  

Previously we described how to use an independent Blend program while designing the UI appearance. After finishing work in Blend, we can return to creating the program text, i.e. return to Visual Studio. Blend is an independent program that can be executed using the operating system interface, including the file browser context menu. It is independent, provided that the results of its work can be uploaded to the repository as an integrated part of the entire program and the history of its changes can be tracked. This will only be possible if its output is text. This is a demand today, which must be followed without any compromise. This is the cause why graphic formats such as GIF, JPG, and PowerPoint files, to name only selected ones for determining the appearance of the GUI are generally a bad idea.

Let's see how this postulate is implemented in the proposed scenario. After returning to Visual Studio, we notice that one of the files has changed. After opening it in the editor, we see that it is a file with XML syntax, i.e. a text file, although there is a similar image. Let's close the image because we should focus on the text itself. However, it should be noted that the image-text relationship exists. Going to the folder where this file is located, we can analyze its changes. I suggest not wasting time examining the changes in the file itself. It is better to spend this time understanding the content and role of this document as a part of our program. So let's go back to Visual Studio.

Probably the first surprise is that instead of CSharp we have XML. There are at least two reasons for this.

  • The first is that the graphics rendering process is not related to the implementation of algorithms related to the process in the CSharp language. In other words, it is a data-centric process. So the first reason is the portability of the work result.
  • The second reason is related to the use of the Blend editor, i.e. an independent software tool. Let me stress that the XML standard was created as a language intended for exchanging data between programs, i.e. for application integration. Here we see how it works in practice for Blend and Visual Studio. Blend and Visual Studio are two independent programs whose functionality is partially compatible.

From the point of view of graphic design, the fact that we are dealing with XML should not worry us much. All that is needed is for people who know colors and shapes to give us the generated file, which we will attach to the program and Visual Studio will do the rest. Unfortunately, this approach is too good to be true. This whole elaborate plan comes down to the fact that sooner or later - and as we can guess rather sooner - we have to start talking about integrating the image with program data and behavior, which is, what we are paid for. However, we define data, i.e. sets of allowed values ​​and operations performed on them, using types and we need to start talking about them. Hence we must learn more about the meaning of this XML document.

XML Compilation Process  

Further examination of using XML documents may start by noticing a seemingly trivial fact: the XML file is coupled with another file with the extension .cs. After opening it we can recognize that it is CSharp text. Moreover, we see the word partial in the header of a class, so we must deal with a partial definition of a type. Maybe these two files create one definition. This only makes sense if the parts are written in the same language - they have the same syntax and semantics. In the case under consideration, this is not met. Here, trying to merge text documents compliant with different languages must lead to a result that is not compliant with any language. Our suspicions are confirmed because as we can see the first element of this XML file contains the class attribute and the name of the partial class that is coupled.

Therefore, we can consider it very likely to be a scenario in which a document written in compliance with a certain language based on XML syntax is converted to the CSharp language. After this, they can be merged into one unified text, creating a unified class definition as a result of merging it from two parts. As a result, we can return to the well-known world of programming in CSharp. We call this new language XAML. According to the scenario presented here, we do not need to know this language. And that would be true as long as a static image is to be created. However, we need to bring it to life, i.e. visualize the process state and the behavior, i.e. display process data, enable data editing, and respond to user commands. We can be reassured by the fact that, in addition to the XAML part, we have a part in CSharp, called code-behind. Additionally, if the compiler can convert XAML to CSharp, maybe we can write everything in CSharp right away. The answer to the question of whether it is possible not to use XAML is positive, so the temptation is great. Unfortunately, this approach is costly. Before starting the cost estimation, we need to understand where they come from, but remember that we have three options. Only Blend, only CSharp, and some combination of them.

To estimate the previously mentioned costs of converting XAML to CSharp and better understand the mechanisms of operation of the environment, we need to look at what the compiler does based on the analysis of the program text. Let's do a short analysis without going into details. In the class constructor, we will find a call to the InitializeComponent method, which - at first glance - is not present (the compiler reports an error). Anyway, let's launch the program with the breakpoint just before the InitializeComponent method. It works, so after breaking execution we can select "Step Into" from the Debug menu to enter the method. We can see that the compiler automatically generates this text, but also it does not contain a simple conversion of the XAML text to CSharp, but instead passes the path to the XAML file to the LoadComponent method.

The implementation of this method is provided by the library, but from the description we can learn that it creates all relevant objects using reflection. Reflection is a higher level of education and these are the costs. Without reflection, error-free conversion of XAML to CSharp is generally impractical or even impossible.

The syntax and semantics of XML files defined by the specification are not sufficient to explain the meaning of the document. Let's try to explain what the word Grid means in a snippet of XAML text taken from an example in the repository. From the context menu, we can go to the definition of this identifier and see that an additional tab opens with the definition of the class with the same name. There is a parameter-less constructor for this class. This allows us to guess the meaning of this XML element as follows: call the parameter-less constructor and, consequently, create and initialize an object of this class. Analyzing the subsequent elements and attributes of this XML file, we see that they refer to properties, i.e. properties of this class.

  • To put it simply, rendering is an activity of creating a composition of pixels on the screen following some formal description - in our case, it is turning text into a living image. Since we compose pixels on the screen, we can only talk about the program execution time. In the case of object-oriented programming, this formal description existing during program execution must be a set of objects connected in a structure, i.e. a graph. Objects are instantiated based on reference types. Therefore, the types that we will use to describe the image must have a common feature, namely an assigned shape. Therefore, the entire image must be a composition of typical shapes that enable the implementation of two additional functions, such as entering data and executing commands. Consequently, these shapes must also be adaptable to current needs. All this can be achieved thanks to the polymorphism paradigm and properties of types.
  • So let's go back to the XAML file. We can recognize it as a formal description of how to instantiate types and an interoperable shape on the computer screen. And now we know that the objects we create must have a common feature, namely, that they can be rendered. If an object is created, what should we do with a reference to it - for example, we create an object based on the definition of the Grid class. If nothing happens after instantiation, the garbage collector will immediately delete and release it. Therefore, let us assume that each object created in compliance with the hierarchy of elements of an XML document is a collection of internal objects. In such a case, the mentioned Grid object would be added to the MainWindow class, but it is not a collection. Note that the MainWindow class inherits from the Window class, which may already be or contain such a collection. As a result, a tree of objects is created, the root element of which - i.e. the trunk - is the MainWindow class, which is a partial class and inherits from the Window class.
  • A systematic discussion of the XAML language is a topic for an independent examination. Let's assume we get an XAML document from the work of aesthetics, ergonomics, and business process specialists. Without going into the details of this file, we can notice that the image created on the screen is also tree-like and consists of images that are further composed of subsequent images. In our example, the window is a kind of array, in which cells contain a list, keys, text fields, etc. In other words, each object we have created is rendered on the screen, i.e. each class formally describing this object must have an associated appearance, so the rules for creating a certain pixel composition. These classes are commonly called controls. So, without going into details, a control is a class definition that implements functionality reproducing a certain shape and behavior on the screen.
  • In other words, any control is a type that encapsulates user interface functionality and is used in client-side applications. This type has associated shape and responsibility to be used on the graphical user interface. The Control is a base class used in .NET applications, and the MSDN documentation explains it in detail. A bunch of derived classes inheriting from this class have been added to the GUI framework, for example, Button.

Bindings - User Interface Interoperability  

Coupling controls with data.

Let's look at an example where the TextBox control is used. Its task is to expose a text on the screen, i.e. a stream of characters. The current value, so what is on the screen, is provided via the Text property. By design, it allows reading and writing string values. The equal sign after the Text property identifier must mean: "transferring the current value to/from the selected place". We already know that the selected place must be a property of some object. The word Binding means that it is attached somehow to ActionText . Hence, the ActionText identifier is probably the name of the property defined in one of the custom types. Let's find this type using the Visual Studio context menu navigation. As we can see, it works and the property has the name as expected.

As you can notice, the navigation works, so Visual Studio has no doubts about the instance of type this property comes from. If Visual Studio knows it, I guess we should know it too. The answer to this question is in these three lines of the [MainWindow DataContext of XAML definition.

Let's start with the middle line that contains a full class name. The namespace has been replaced by the vm alias defined a few lines above. The class definition has been opened as a result of previous navigation to a property containing the text for the TextBox control. Let's consider what the class name means here. For the sake of simplicity, let's first look up the meaning of the DataContext identifier. It is the name of the property. It is of the object type. The object is the base type for all types. Since it's a property, we can read or assign a new value to it. Having discarded all the absurd propositions, it is easy to guess that the MainViewModel identifier here means a parameter-less constructor of the MainViewModel type, and this entire fragment should be recognized as the equivalent of an association statement to the DataContext property of a newly created instance of the MainViewModel type. In other words, it is equivalent to the following statement

Finally, at run-time, we can consider this object as a source and repository of process data used by the user interface. From a data point of view, it creates a kind of mirror of what is on the screen

Let's go back to the previous example with the TextBox control and couple its Text property with the ActionText property from the class whose instance reference is assigned to the DataContext property. Here, the magic word Binding may be recognized as a virtual connection that transfers values between interconnected properties. When asked how this happens and what the word Binding means, i.e. when asked about the semantics of this notation, I usually receive an answer like this "It is some magic wand, which should be read as an internal implementation of WPF", and Binding is a keyword of the XAML language. This explanation would be sufficient, although it is a colloquialism and simplification. Unfortunately, we need to understand at least when this transfer is undertaken. The answer to this question is fundamental to understanding the requirements for the classes that can be used to create an object whose reference is assigned to the DataContext property. The main goal is to keep the screen up to date. To find the answer, let's try to go to the definition of the Binding identifier using the context menu or the F12 key.

It turns out that Binding is the identifier of a class or rather a constructor of this class. This must consequently mean that at this point a magic wand means creating an instance of the Binding class that is responsible for transferring values ​​from one property to another. The properties defined in the Binding type can be used to control how this transfer is performed. Since this object must operate on unknown types, reflection is used. This means that this mechanism is rarely analyzed in detail. The colloquial explanation previously given that the transfer is somehow carried out is quite common because it has its advantages in the context of describing the effect.

The AttachedProperty class definition simulates this reflection-based action providing the functionality of assigning a value to the indicated property of an object whose type is unknown.

Using properties defined in the Binding type, we can parameterize the transfer process and, for example, limit its direction. Operations described by the XAML text are performed once at the beginning of the program when the MainWindow instance is created. Therefore, we cannot here specify the point in time when this transfer should be carried out. To determine the point in time when an instance of the Binding type should trigger this transfer, let's look at the structure of the ActionText property in the MainViewWindow type. Here we see that the setter (used to update the current value) performs two additional methods. In the context of the main problem, the RaisePropertyChanged method is invoked. This method activates the PropertyChanged event required to implement the INotifyPropertyChanged interface.

This event is used by objects of the Binding class to invoke the current value transfer. As a result of activating this event, we call methods whose delegates have been added to the PropertyChanged event required by the mentioned interface. If the class does not implement this interface or the activation of the PropertyChanged event does not occur, the new value assigned to a property will not be pulled and transferred to the bonded property of a control. Finally, the screen will not be refreshed - the screen will be static.

It is typical communication where the MainViewWindow instance notifies about the selected value change and the MainWindow instance pulls a new value and displays it. In this kind of communication, the MainViewWindow has a publisher role, and the MainWindow is the subscriber. It is worth stressing that communication is a run-time activity. It is initiated in the opposite direction compared with the compile time types relationship. Hence we can recognize it as an inversion of control or a callback communication method.

The analysis of the previous examples shows the screen content synchronization mechanism with the property values change ​​of classes dedicated to providing data for the GUI. Now we need to explain the sequence of operations carried out as a consequence of issuing a command by the user interface, e.g. clicking on the on-screen key - Button. We have an example here, and its Command property has been associated, as before, with something with the identifier ShowTreeViewMainWindowCommend . Using navigation in Visual Studio, we can go to the definition of this identifier and notice that it is again a property from the MainViewWindow class, but of the ICommand type. This time, this binding is not used to copy the property value but to convert a key click on the screen, e.g. using a mouse, into calling the Execute operation, which is defined in the ICommand interface and must be implemented in the class that serves to create an object and assign a reference to it into this property.

For the sake of simplicity, the ICommand interface is implemented by a helper class called RelayCommand . In the constructor of this class, you should place a delegate to the method to be called as a result of the command execution. The second constructor is helpful in dynamically changing the state of a button on the screen. This can block future events, i.e. realize a state machine. And this is exactly the scenario implemented in the examined example program. Please note the RaiseCanExecuteChanged method was omitted in the previous explanation.

It may sound mysterious at first, but the fact that the graphical user interface is an element of the program is obvious to everyone. However, it is not so obvious to everyone that it is not an integral part of the executing program process. Let's look at the diagram below, where we see the GUI as something external to the program. Like streaming and structured data. This interface can even be deployed on another physical machine. In such a case, the need for communication between machines must also be considered.

Program Bootstrap

As a result, we must look at the User Interface and the running program as two independent entities operating in asynchronous environments. So the problem is how to synchronize its content and behavior with the program flow.

In object-oriented programming, launching a program must cause instantiation and initialization of a first object. Its constructor therefore contains the instruction that is first executed by the operating system process to be a platform for running the program. This raises the question of how to find it.

Each project contains a configuration file. In the project, its content can be read using the context menu. There is a place where the Startup Object may be selected. There is only one to choose from, and its name syntax resembles a type name.

Since this is an automatically generated but custom type, it is worth asking how the development environment selects types to this list. Could there be more items on this list?

Since this is the Startup Object, the identifier in the Dropbox must be the class name. We find the App type in the solution explorer tree. After opening it, we see that it is XAML-compliant text. Notice that this file is coupled with a CSharp file. This is another example of a partial class written in two languages, so we expect XAML to CSharp conversion and text merging. In this definition of the App type, we can find a reference to another XAML file, namely an assignment to the StartupUri property pointing to the MainWindow.xaml file. It contains the definition of the graphical user interface, often called a shell.

It is worth paying attention to the fact that this class inherits from the Application class. The definition of this class is practically empty, i.e. it doesn't even have a constructor, which means that the default constructor is executed, i.e. does nothing. However, this allows you to define your parameter-less constructor. You can also overwrite selected methods from the base class to adapt the behavior to the program's individual needs. We can locate the required auxiliary activities using the mentioned language constructs here before implementing business logic. A typical example is preparing the infrastructure related to program execution tracking, calling the Dispose operation for all objects that require it before the program ends, and creating additional objects related to business logic or preparing the infrastructure for dependency injection.

An image is a composition of colored pixels. They must be composed in such a way as to represent selected process information, i.e. its state and behavior. We do not create a Graphical User Interface (GUI for short) by laboriously assembling pixels into a coherent composition. Generating such graphics requires a formal description. A dedicated domain-specific language called Extensible Application Markup Language (XAML for short) is examined in this article. By design, it is used to describe formally what we see on the screen and interoperability of the user interface. This language is based on XML so the first question is why XML and what difference is between XAML and XML. The XML-based documents must be integrated with the CSharp counterpart somehow therefore based on XML syntax, semantics of XAML, and partial definitions a consistent program is generated. To make the user interface interoperable the rendered controls are bound to the process data. Last but not least addressed in this article topic is bootstrapping the application.

This article is part of a series dedicated to discussing selected issues related to representing process information in graphical form. If you are interested in this topic, you may be interested to check out previous articles listed in the "See also" section. The main goal of referring selected topics GUI, MVVM, XAML, Binding, and Communication (to name only selected ones) is to improve understanding of the user interface's design process, making the development process faster, chipper, and more portable. In other words, the mentioned technology is used only to ensure the engineering level of the discussion because the discussed topics are independent of the technology used, and similar problems are encountered.

  • XAML in WPF ; Learn .NET Windows Presentation Foundation; 06/02/2023
  • TreeView Overview ; Learn .NET Windows Presentation Foundation; 02/06/2023
  • Control Class ; Learn .NET API browser System.Windows.Controls;
  • Programming in Practice - Graphical User Interface (GUI) ; M. Postol; C-SHARPCORNER; 2024
  • Programming in Practice - GUI - MVVM Program Design Pattern ; M. Postol; C-SHARPCORNER; 2024

C# Corner Ebook

Printing in C# Made Easy

  • Data Structures
  • Linked List
  • Binary Tree
  • Binary Search Tree
  • Segment Tree
  • Disjoint Set Union
  • Fenwick Tree
  • Red-Black Tree
  • Advanced Data Structures
  • Graph Data Structure And Algorithms

Introduction to Graph Data Structure

  • Graph and its representations
  • Types of Graphs with Examples
  • Basic Properties of a Graph
  • Applications, Advantages and Disadvantages of Graph
  • Transpose graph
  • Difference Between Graph and Tree

BFS and DFS on Graph

  • Breadth First Search or BFS for a Graph
  • Depth First Search or DFS for a Graph
  • Applications, Advantages and Disadvantages of Depth First Search (DFS)
  • Applications, Advantages and Disadvantages of Breadth First Search (BFS)
  • Iterative Depth First Traversal of Graph
  • BFS for Disconnected Graph
  • Transitive Closure of a Graph using DFS
  • Difference between BFS and DFS

Cycle in a Graph

  • Detect Cycle in a Directed Graph
  • Detect cycle in an undirected graph
  • Detect Cycle in a directed graph using colors
  • Detect a negative cycle in a Graph | (Bellman Ford)
  • Cycles of length n in an undirected and connected graph
  • Detecting negative cycle using Floyd Warshall
  • Clone a Directed Acyclic Graph

Shortest Paths in Graph

  • How to find Shortest Paths from Source to all Vertices using Dijkstra's Algorithm
  • Bellman–Ford Algorithm
  • Floyd Warshall Algorithm
  • Johnson's algorithm for All-pairs shortest paths
  • Shortest Path in Directed Acyclic Graph
  • Multistage Graph (Shortest Path)
  • Shortest path in an unweighted graph
  • Karp's minimum mean (or average) weight cycle algorithm
  • 0-1 BFS (Shortest Path in a Binary Weight Graph)
  • Find minimum weight cycle in an undirected graph

Minimum Spanning Tree in Graph

  • Kruskal’s Minimum Spanning Tree (MST) Algorithm
  • Difference between Prim's and Kruskal's algorithm for MST
  • Applications of Minimum Spanning Tree
  • Total number of Spanning Trees in a Graph
  • Minimum Product Spanning Tree
  • Reverse Delete Algorithm for Minimum Spanning Tree

Topological Sorting in Graph

  • Topological Sorting
  • All Topological Sorts of a Directed Acyclic Graph
  • Kahn's algorithm for Topological Sorting
  • Maximum edges that can be added to DAG so that it remains DAG
  • Longest Path in a Directed Acyclic Graph
  • Topological Sort of a graph using departure time of vertex

Connectivity of Graph

  • Articulation Points (or Cut Vertices) in a Graph
  • Biconnected Components
  • Bridges in a graph
  • Eulerian path and circuit for undirected graph
  • Fleury's Algorithm for printing Eulerian Path or Circuit
  • Strongly Connected Components
  • Count all possible walks from a source to a destination with exactly k edges
  • Euler Circuit in a Directed Graph
  • Word Ladder (Length of shortest chain to reach a target word)
  • Find if an array of strings can be chained to form a circle | Set 1
  • Tarjan's Algorithm to find Strongly Connected Components
  • Paths to travel each nodes using each edge (Seven Bridges of Königsberg)
  • Dynamic Connectivity | Set 1 (Incremental)

Maximum flow in a Graph

  • Max Flow Problem Introduction
  • Ford-Fulkerson Algorithm for Maximum Flow Problem
  • Find maximum number of edge disjoint paths between two vertices
  • Find minimum s-t cut in a flow network
  • Maximum Bipartite Matching
  • Channel Assignment Problem
  • Introduction to Push Relabel Algorithm
  • Introduction and implementation of Karger's algorithm for Minimum Cut
  • Dinic's algorithm for Maximum Flow

Some must do problems on Graph

  • Find size of the largest region in Boolean Matrix
  • Count number of trees in a forest
  • A Peterson Graph Problem
  • Clone an Undirected Graph
  • Introduction to Graph Coloring
  • Traveling Salesman Problem (TSP) Implementation
  • Introduction and Approximate Solution for Vertex Cover Problem
  • Erdos Renyl Model (for generating Random Graphs)
  • Chinese Postman or Route Inspection | Set 1 (introduction)
  • Hierholzer's Algorithm for directed graph
  • Boggle (Find all possible words in a board of characters) | Set 1
  • Hopcroft–Karp Algorithm for Maximum Matching | Set 1 (Introduction)
  • Construct a graph from given degrees of all vertices
  • Determine whether a universal sink exists in a directed graph
  • Number of sink nodes in a graph
  • Two Clique Problem (Check if Graph can be divided in two Cliques)

Graph Data Structure is a non-linear data structure consisting of vertices and edges. It is useful in fields such as social network analysis, recommendation systems, and computer networks. In the field of sports data science, graph data structure can be used to analyze and understand the dynamics of team performance and player interactions on the field.

Introduction-to-Graphs

Table of Content

What is Graph Data Structure?

Components of graph data structure.

  • Types Of Graph Data Structure
  • Representation of Graph Data Structure
  • Adjacency Matrix Representation of Graph Data Structure
  • Adjacency List Representation of Graph
  • Basic Operations on Graph Data Structure
  • Difference between Tree and Graph
  • Real-Life Applications of Graph Data Structure
  • Advantages of Graph Data Structure
  • Disadvantages of Graph Data Structure
  • Frequently Asked Questions(FAQs) on Graph Data Structure

Graph is a non-linear data structure consisting of vertices and edges. The vertices are sometimes also referred to as nodes and the edges are lines or arcs that connect any two nodes in the graph. More formally a Graph is composed of a set of vertices( V ) and a set of edges( E ). The graph is denoted by G(V, E).

Imagine a game of football as a web of connections, where players are the nodes and their interactions on the field are the edges. This web of connections is exactly what a graph data structure represents, and it’s the key to unlocking insights into team performance and player dynamics in sports.

  • Vertices: Vertices are the fundamental units of the graph. Sometimes, vertices are also known as vertex or nodes. Every node/vertex can be labeled or unlabelled.
  • Edges: Edges are drawn or used to connect two nodes of the graph. It can be ordered pair of nodes in a directed graph. Edges can connect any two nodes in any possible way. There are no rules. Sometimes, edges are also known as arcs. Every edge can be labelled/unlabelled.

Types Of Graph Data Structure:

1. null graph.

A graph is known as a null graph if there are no edges in the graph.

2. Trivial Graph

introduction to graphical representation of data

3. Undirected Graph

A graph in which edges do not have any direction. That is the nodes are unordered pairs in the definition of every edge. 

4. Directed Graph

A graph in which edge has direction. That is the nodes are ordered pairs in the definition of every edge.

introduction to graphical representation of data

5. Connected Graph

The graph in which from one node we can visit any other node in the graph is known as a connected graph. 

6. Disconnected Graph

The graph in which at least one node is not reachable from a node is known as a disconnected graph.

introduction to graphical representation of data

7. Regular Graph

The graph in which the degree of every vertex is equal to K is called K regular graph.

8. Complete Graph

introduction to graphical representation of data

9. Cycle Graph

The graph in which the graph is a cycle in itself, the degree of each vertex is 2. 

10. Cyclic Graph

A graph containing at least one cycle is known as a Cyclic graph.

introduction to graphical representation of data

11. Directed Acyclic Graph

A Directed Graph that does not contain any cycle. 

12. Bipartite Graph

A graph in which vertex can be divided into two sets such that vertex in each set does not contain any edge between them.

introduction to graphical representation of data

13. Weighted Graph

  •   A graph in which the edges are already specified with suitable weight is known as a weighted graph. 
  •  Weighted graphs can be further classified as directed weighted graphs and undirected weighted graphs. 

Representation of Graph Data Structure:

There are two ways to store a graph:

  • Adjacency Matrix
  • Adjacency List

Adjacency Matrix Representation of Graph Data Structure:

In this method, the graph is stored in the form of the 2D matrix where rows and columns denote vertices. Each entry in the matrix represents the weight of the edge between those vertices. 

adjacency_mat1-(1)-copy

Below is the implementation of Graph Data Structure represented using Adjacency Matrix:

Adjacency List Representation of Graph:

This graph is represented as a collection of linked lists. There is an array of pointer which points to the edges connected to that vertex. 

introduction to graphical representation of data

Below is the implementation of Graph Data Structure represented using Adjacency List:

Comparison between Adjacency Matrix and Adjacency List

When the graph contains a large number of edges then it is good to store it as a matrix because only some entries in the matrix will be empty. An algorithm such as Prim’s and Dijkstra adjacency matrix is used to have less complexity.

ActionAdjacency MatrixAdjacency List
Adding EdgeO(1)O(1)
Removing an edgeO(1)O(N)
InitializingO(N*N)O(N)

Basic Operations on Graph Data Structure:

Below are the basic operations on the graph:

  • Add and Remove vertex in Adjacency List representation of Graph
  • Add and Remove vertex in Adjacency Matrix representation of Graph
  • Add and Remove Edge in Adjacency List representation of a Graph
  • Add and Remove Edge in Adjacency Matrix representation of a Graph
  • Searching in Graph Data Structure- Search an entity in the graph.
  • Traversal of Graph Data Structure- Traversing all the nodes in the graph.

Difference between Tree and Graph:

Tree is a restricted type of Graph Data Structure, just with some more rules. Every tree will always be a graph but not all graphs will be trees. Linked List , Trees , and Heaps all are special cases of graphs. 

introduction to graphical representation of data

Real-Life Applications of Graph Data Structure:

Graph Data Structure has numerous real-life applications across various fields. Some of them are listed below:

introduction to graphical representation of data

  • Graph Data Structure is used to represent social networks, such as networks of friends on social media.
  • It can be used to represent the topology of computer networks, such as the connections between routers and switches.
  • It can used to represent the connections between different places in a transportation network, such as roads and airports.
  • Neural Networks: Vertices represent neurons and edges represent the synapses between them. Neural networks are used to understand how our brain works and how connections change when we learn. The human brain has about 10^11 neurons and close to 10^15 synapses.
  • Compilers: Graph Data Structure is used extensively in compilers. They can be used for type inference, for so-called data flow analysis, register allocation, and many other purposes. They are also used in specialized compilers, such as query optimization in database languages.
  • Robot planning: Vertices represent states the robot can be in and the edges the possible transitions between the states. Such graph plans are used, for example, in planning paths for autonomous vehicles.

Advantages of Graph Data Structure:

  • Graph Data Structure used to represent a wide range of relationships and data structures.
  • They can be used to model and solve a wide range of problems, including pathfinding, data clustering, network analysis, and machine learning.
  • Graph algorithms are often very efficient and can be used to solve complex problems quickly and effectively.
  • Graph Data Structure can be used to represent complex data structures in a simple and intuitive way, making them easier to understand and analyze.

Disadvantages of Graph Data Structure:

  • Graph Data Structure can be complex and difficult to understand, especially for people who are not familiar with graph theory or related algorithms.
  • Creating and manipulating graphs can be computationally expensive, especially for very large or complex graphs.
  • Graph algorithms can be difficult to design and implement correctly, and can be prone to bugs and errors.
  • Graph Data Structure can be difficult to visualize and analyze, especially for very large or complex graphs, which can make it challenging to extract meaningful insights from the data.

Frequently Asked Questions(FAQs) on Graph Data Structure:

1. what is a graph.

A graph is a data structure consisting of a set of vertices (nodes) and a set of edges that connect pairs of vertices.

2. What are the different types of Graph Data Structure?

Graph Data Structure can be classified into various types based on properties such as directionality of edges (directed or undirected), presence of cycles (acyclic or cyclic), and whether multiple edges between the same pair of vertices are allowed (simple or multigraph).

3. What are the applications of Graph Data Structure?

Graph Data Structure has numerous applications in various fields, including social networks, transportation networks, computer networks, recommendation systems, biology, chemistry, and more.

4. What is the difference between a directed graph and an undirected graph?

In an undirected graph, edges have no direction, meaning they represent symmetric relationships between vertices. In a directed graph (or digraph), edges have a direction, indicating a one-way relationship between vertices.

5. What is a weighted graph?

A weighted graph is a graph in which each edge is assigned a numerical weight or cost. These weights can represent distances, costs, or any other quantitative measure associated with the edges.

6. What is the degree of a vertex in a graph?

The degree of a vertex in a graph is the number of edges incident to that vertex. In a directed graph, the indegree of a vertex is the number of incoming edges, and the outdegree is the number of outgoing edges.

7. What is a path in a graph?

A path in a graph is a sequence of vertices connected by edges. The length of a path is the number of edges it contains.

8. What is a cycle in a graph?

A cycle in a graph is a path that starts and ends at the same vertex, traversing a sequence of distinct vertices and edges in between.

9. What are spanning trees and minimum spanning trees?

A spanning tree of a graph is a subgraph that is a tree and includes all the vertices of the original graph. A minimum spanning tree (MST) is a spanning tree with the minimum possible sum of edge weights.

10. What algorithms are commonly used to traverse or search Graph Data Structure?

Common graph traversal algorithms include depth-first search (DFS) and breadth-first search (BFS). These algorithms are used to explore or visit all vertices in a graph, typically starting from a specified vertex. Other algorithms, such as Dijkstra’s algorithm and Bellman-Ford algorithm, are used for shortest path finding.

More Resources of Graph:

  • Recent Articles on Graph
  • Practice problems on Graph
  • Algorithms on Graphs

Please Login to comment...

Similar reads, improve your coding skills with practice.

 alt=

What kind of Experience do you want to share?

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: bounding boxes and probabilistic graphical models: video anomaly detection simplified.

Abstract: In this study, we formulate the task of Video Anomaly Detection as a probabilistic analysis of object bounding boxes. We hypothesize that the representation of objects via their bounding boxes only, can be sufficient to successfully identify anomalous events in a scene. The implied value of this approach is increased object anonymization, faster model training and fewer computational resources. This can particularly benefit applications within video surveillance running on edge devices such as cameras. We design our model based on human reasoning which lends itself to explaining model output in human-understandable terms. Meanwhile, the slowest model trains within less than 7 seconds on a 11th Generation Intel Core i9 Processor. While our approach constitutes a drastic reduction of problem feature space in comparison with prior art, we show that this does not result in a reduction in performance: the results we report are highly competitive on the benchmark datasets CUHK Avenue and ShanghaiTech, and significantly exceed on the latest State-of-the-Art results on StreetScene, which has so far proven to be the most challenging VAD dataset.
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: [cs.CV]
  (or [cs.CV] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Enhancement of patient's health prediction system in a graphical representation using digital twin technology

  • Published: 10 July 2024

Cite this article

introduction to graphical representation of data

  • M. Sobhana   ORCID: orcid.org/0009-0007-9915-725X 1 ,
  • Smitha Chowdary Ch 2 ,
  • Sowmya Koneru 3 ,
  • G. Krishna Mohan 4 &
  • K. Kranthi Kumar 5  

The patient health prediction system is the most critical study in medical research. Several prediction models exist to predict the patient's health condition. However, a relevant result was not attained because of poor quality. The IoT-sensed data contains more noise content, which maximizes the complexity of the health prediction. These demerits resulted in low prediction and performance scores. So the proposed work aims to develop a novel Coati-based Recurrent Digital Twin Framework (CbRDTF) to predict the patients' health conditions. The novelty of this research lies in the combined function of the coati optimization and recurrent network with the digital twin for health prediction. Initially, the IoT-sensed data was imported, the data were preprocessed, and meaningful features were selected. Then predict the health condition of the patients and classify the conditions. Here the incorporated coati function at the classification layer extracted the relevant features from the sensed medical data for the robust prediction and also modified the parameters of the recurrent digital twin to improve the prediction and classification accuracy. Finally, the performance was measured; the presented model attained a high exactness score of 99.81% in the prediction accuracy, recall, f-value, and precision; it also, validated the computation time of 611.81 s.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price excludes VAT (USA) Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

introduction to graphical representation of data

Data availability

Data sharing not applicable to this article as no datasets were generated or analysed during the current study.

Kang JS, Chung K, Hong EJ (2021) Multimedia knowledge-based bridge health monitoring using digital twin. Multimed Tools Appl 80:34609–34624. https://doi.org/10.1007/s11042-021-10649-x

Article   Google Scholar  

Garg H, Sharma B, Shekhar S et al (2022) Spoofing detection system for e-health digital twin using EfficientNet Convolution Neural Network. Multimed Tools Appl 81:26873–26888. https://doi.org/10.1007/s11042-021-11578-5

Sengan S, Kumar K, Subramaniyaswamy V et al (2022) Cost-effective and efficient 3D human model creation and re-identification application for human digital twins. Multimed Tools Appl 81:26839–26856. https://doi.org/10.1007/s11042-021-10842-y

Khan F, Ghaffar A, Khan N, Cho SH (2020) An overview of signal processing techniques for remote health monitoring using impulse radio UWB transceiver. Sensors 20(9):2479. https://doi.org/10.3390/s20092479

Rudnicka E, Napierała P, Podfigurna A, Męczekalski B, Smolarczyk R, Grymowicz M (2020) The World Health Organization’s (WHO) approach to healthy ageing. Maturitas 139:6–11. https://doi.org/10.1016/j.maturitas.2020.05.018

Greco L, Percannella G, Ritrovato P, Tortorella F, Vento M (2020) Trends in IoT-based solutions for health care: Moving AI to the edge. Pattern Recognit Lett 135:346–353. https://doi.org/10.1016/j.patrec.2020.05.016

Sandhiya S, Palani U (2020) An effective disease prediction system using incremental feature selection and temporal convolutional neural network. J Ambient Intell Humaniz Comput 11(11):5547–5560. https://doi.org/10.1007/s12652-020-01910-6

Adeniyi EA, Ogundokun RO, Awotunde JB (2021) IoMT-based wearable body sensors network healthcare monitoring system. IoT in healthcare and ambient assisted living, Springer, Singapore, pp 103–121. https://doi.org/10.1007/978-981-15-9897-5_6

Ahmed I, Ahmad M, Jeon G, Piccialli F (2021) A framework for pandemic prediction using big data analytics. Big Data Res 25:100190. https://doi.org/10.1016/j.bdr.2021.100190

Roy S, Meena T, Lim SJ (2022) Demystifying supervised learning in healthcare 4.0: A new reality of transforming diagnostic medicine. Diagnostics 12(10):2549. https://doi.org/10.3390/diagnostics12102549

Cerchione R, Centobelli P, Riccio E, Abbate S, Oropallo E (2023) Blockchain’s coming to the hospital to digitalize healthcare services: Designing a distributed electronic health record ecosystem. Technovation 120:102480. https://doi.org/10.1016/j.technovation.2022.102480

Minerva R, Lee GM, Crespi N (2020) Digital twin in the IoT context: A survey on technical features, scenarios, and architectural models. Proc IEEE 108(10):1785–1824. https://doi.org/10.1109/JPROC.2020.2998530

Pirbhulal S, Abie H, Shukla A (2022) Towards a Novel Framework for Reinforcing Cybersecurity using Digital Twins in IoT-based Healthcare Applications. 2022 IEEE 95th Vehicular Technology Conference: (VTC2022-Spring), IEEE. https://doi.org/10.1109/VTC2022-Spring54318.2022.9860581

Haq AU, Li J, Memon MH, Memon MH, Khan J, Marium SM (2019) Heart disease prediction system using the model of machine learning and sequential backward selection algorithm for features selection. 2019 IEEE 5th International Conference for Convergence in Technology (I2CT), IEEE. https://doi.org/10.1109/I2CT45611.2019.9033683

McLachlan S, Dube K, Hitman GA, Fenton NE, Kyrimi E (2020) Bayesian networks in healthcare: Distribution by a medical condition. Artif Intell Med 107:101912. https://doi.org/10.1016/j.artmed.2020.101912

Kavitha M, Gnaneswar G, Dinesh R, Sai YR, Suraj RS (2021) Heart disease prediction using hybrid machine learning model. 2021 6th international conference on innovative computation technologies (ICICT), IEEE. https://doi.org/10.1109/ICICT50816.2021.9358597

Uddin MZ (2019) A wearable sensor-based activity prediction system to facilitate edge computing in intelligent healthcare systems. J Parallel Distrib Comput 123:46–53. https://doi.org/10.1016/j.jpdc.2018.08.010

Al-Shammari NK, Alzamil AA, Albadarn M, Ahmed SA, Syed MB, Alshammari AS, Gabr AM (2021) Cardiac stroke prediction framework using hybrid optimization algorithm under DNN. Eng Appl Sci Res 11(4):7436–41. https://doi.org/10.48084/etasr.4277

Mukherjee A, Ghosh S, Behere A, Ghosh SK, Buyya R (2021) Internet of Health Things (IoHT) for personalized health care using an integrated edge-fog-cloud network. J Ambient Intell Humaniz Comput 12:943–959. https://doi.org/10.1007/s12652-020-02113-9

Aceto G, Persico V, Pescapé A (2020) Industry 4.0 and health: Internet of things, big data, and cloud computing for healthcare 4.0. J Ind Inf Integr 18:100129. https://doi.org/10.1016/j.jii.2020.100129

Savitha V, Karthikeyan N, Karthik S, Sabitha R (2021) A distributed key authentication and OKM-ANFIS scheme based breast cancer prediction system in the IoT environment. J Ambient Intell Humaniz Comput 12:1757–1769. https://doi.org/10.1007/s12652-020-02249-8

Ali F, El-Sappagh S, Islam SR, Kwak D, Ali A, Imran M, Kwak KS (2020) An intelligent healthcare monitoring system for heart disease prediction based on deep ensemble learning and feature fusion. Inf Fusion 63:208–222. https://doi.org/10.1016/j.inffus.2020.06.008

Chatrati SP, Hossain G, Goyal A, Bhan A, Bhattacharya S, Gaurav D, Tiwari SM (2022) Smart home health monitoring system for predicting type 2 diabetes and hypertension. J King Saud Univ - Comput Inf Sci 34(3):862–870. https://doi.org/10.1016/j.jksuci.2020.01.010

Jackins V, Vimal S, Kaliappan M, Lee MY (2021) AI-based smart prediction of clinical disease using random forest classifier and Naive Bayes. J Supercomput 77:5198–5219. https://doi.org/10.1007/s11227-020-03481-x

Lakshmanaprabu SK, Mohanty SN, Krishnamoorthy S, Uthayakumar J, Shankar K (2019) Online clinical decision support system using optimal deep neural networks. Appl Soft Comput 81:105487. https://doi.org/10.1016/j.asoc.2019.105487

Huifeng W, Kadry SN, Raj ED (2020) Continuous health monitoring of sportspersons using IoT devices-based wearable technology. Comput Commun 160:588–595. https://doi.org/10.1016/j.comcom.2020.04.025

Morid MA, Sheng OR, Kawamoto K, Abdelrahman S (2020) Learning hidden patterns from patient multivariate time series data using convolutional neural networks: A case study of healthcare cost prediction. J Biomed Inform 111:103565. https://doi.org/10.1016/j.jbi.2020.103565

Khan MA (2020) An IoT framework for heart disease prediction based on MDCNN classifier. IEEE Access 8:34717–34727. https://doi.org/10.1109/ACCESS.2020.2974687

Elayan H, Aloqaily M, Guizani M (2021) Digital twin for intelligent context-aware IoT healthcare systems. IEEE Internet Things J 8(23):16749–16757. https://doi.org/10.1109/JIOT.2021.3051158

Abilkaiyrkyzy A, Laamarti F, Hamdi M, El Saddik A (2024) Dialogue System for Early Mental Illness Detection: Towards a Digital Twin Solution. IEEE Access 12:2007–2024. https://doi.org/10.1109/ACCESS.2023.3348783

Haleem A, Javaid M, Singh RP, Suman R (2023) Exploring the revolution in healthcare systems through the applications of digital twin technology. Biomed Technol 4:28–38. https://doi.org/10.1016/j.bmt.2023.02.001

Chakshu NK, Sazonov I, Nithiarasu P (2021) Towards enabling a cardiovascular digital twin for human systemic circulation using inverse analysis. Biomech Model Mechanobiol 20(2):449–465. https://doi.org/10.1007/s10237-020-01393-6

Chen J, Wang W, Fang B, Liu Y, Yu K, Leung VCM, Hu X (2023) Digital twin empowered wireless healthcare monitoring for smart home. IEEE J Sel Areas Commun 41(11):3662–3676. https://doi.org/10.1109/JSAC.2023.3310097

Roccetti M (2023) Predictive health intelligence: Potential, limitations and sense making. Math Biosci Eng 20(6):10459–10463. https://doi.org/10.3934/mbe.2023460

Alinsaif S (2024) Unraveling Arrhythmias with Graph-Based Analysis: A Survey of the MIT-BIH Database. Computation 12(2):21. https://doi.org/10.3390/computation12020021

Download references

Acknowledgements

Author information, authors and affiliations.

Department of Computer Science and Engineering, V.R. Siddhartha Engineering College, Vijayawada, 520007, Andhra Pradesh, India

Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram, 522302, Andhra Pradesh, India

Smitha Chowdary Ch

Department of Computer Science and Engineering, Dhanekula Institute of Engineering & Technology, Vijayawada, 521139, Andhra Pradesh, India

Sowmya Koneru

Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram, Andhra Pradesh, India

G. Krishna Mohan

Department of IT, Vasireddy Venkatadri Institute of Technology, Guntur, Andhra Pradesh, 522508, India

K. Kranthi Kumar

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to M. Sobhana .

Ethics declarations

Ethical approval.

All applicable institutional and/or national guidelines for the care and use of animals were followed.

Informed consent

For this type of analysis formal consent is not needed.

Conflict of interest

The authors declare that they have no potential conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Sobhana, M., Ch, S.C., Koneru, S. et al. Enhancement of patient's health prediction system in a graphical representation using digital twin technology. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19759-8

Download citation

Received : 03 July 2023

Revised : 08 April 2024

Accepted : 23 June 2024

Published : 10 July 2024

DOI : https://doi.org/10.1007/s11042-024-19759-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Digital Twin Framework
  • Internet of Things
  • Coati Optimization
  • Artificial Intelligence
  • Deep Neural Network
  • Find a journal
  • Publish with us
  • Track your research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 05 July 2024

PND-Net: plant nutrition deficiency and disease classification using graph convolutional network

  • Asish Bera 1 ,
  • Debotosh Bhattacharjee 2 , 3 &
  • Ondrej Krejcar 3 , 4 , 5  

Scientific Reports volume  14 , Article number:  15537 ( 2024 ) Cite this article

319 Accesses

Metrics details

  • Biomedical engineering
  • Health care
  • Medical imaging

Crop yield production could be enhanced for agricultural growth if various plant nutrition deficiencies, and diseases are identified and detected at early stages. Hence, continuous health monitoring of plant is very crucial for handling plant stress. The deep learning methods have proven its superior performances in the automated detection of plant diseases and nutrition deficiencies from visual symptoms in leaves. This article proposes a new deep learning method for plant nutrition deficiencies and disease classification using a graph convolutional network (GNN), added upon a base convolutional neural network (CNN). Sometimes, a global feature descriptor might fail to capture the vital region of a diseased leaf, which causes inaccurate classification of disease. To address this issue, regional feature learning is crucial for a holistic feature aggregation. In this work, region-based feature summarization at multi-scales is explored using spatial pyramidal pooling for discriminative feature representation. Furthermore, a GCN is developed to capacitate learning of finer details for classifying plant diseases and insufficiency of nutrients. The proposed method, called P lant N utrition Deficiency and D isease Net work (PND-Net), has been evaluated on two public datasets for nutrition deficiency, and two for disease classification using four backbone CNNs. The best classification performances of the proposed PND-Net are as follows: (a) 90.00% Banana and 90.54% Coffee nutrition deficiency; and (b) 96.18% Potato diseases and 84.30% on PlantDoc datasets using Xception backbone. Furthermore, additional experiments have been carried out for generalization, and the proposed method has achieved state-of-the-art performances on two public datasets, namely the Breast Cancer Histopathology Image Classification (BreakHis 40 \(\times \) : 95.50%, and BreakHis 100 \(\times \) : 96.79% accuracy) and Single cells in Pap smear images for cervical cancer classification (SIPaKMeD: 99.18% accuracy). Also, the proposed method has been evaluated using five-fold cross validation and achieved improved performances on these datasets. Clearly, the proposed PND-Net effectively boosts the performances of automated health analysis of various plants in real and intricate field environments, implying PND-Net’s aptness for agricultural growth as well as human cancer classification.

Similar content being viewed by others

introduction to graphical representation of data

CPD-CCNN: classification of pepper disease using a concatenation of convolutional neural network models

introduction to graphical representation of data

Analysis of banana plant health using machine learning techniques

introduction to graphical representation of data

Early stage black pepper leaf disease prediction based on transfer learning using ConvNets

Introduction.

Agricultural production plays a crucial role in the sustainable economic and societal growth of a country. High-quality crop yield production is essential for satisfying global food demands and better health. However, several key factors, such as environmental barriers, pollution, and climate change, adversely affect crop yield and quality. Nevertheless, poor soil-nutrition management causes severe plant stress, leading to different diseases and resulting in a substantial financial loss. Thus, plant nutrition diagnosis and disease detection at an early stage is of utmost importance for overall health monitoring of plants 1 . Nutrition management in agriculture is a decisive task for maintaining the growth of plants. In recent times, it has been witnessed the success of machine learning (ML) techniques for developing decision support systems over traditional manual supervision of agricultural yield. Moreover, nutrient management is critical for improving production growth, focusing on a robust and low-cost solution. Intelligent automated systems based on ML effectively build more accurate predictive models, which are relevant for improving agricultural production.

Nutrient deficiency in plants exhibits certain visual symptoms and may cause of poor crop yields 2 . Diagnosis of plant nutrient inadequacy using deep learning and related intelligent methods is an emerging area in precision agriculture and plant pathology 3 . Automated detection and classification of nutrient deficiencies using computer vision and artificial intelligence have been studied in the recent literature 4 , 5 , 6 , 7 , 8 . Diagnosis of nutrient deficiencies in various plants (e.g., rice, banana, guava, palm oil, apple, lettuce, etc.) is vital, because soil ingredients often can not provide the nutrients as required for the growth of plants 9 , 10 , 11 , 12 . Also, early stage detection of leaf diseases (e.g., potato, rice, cucumber, etc.) and pests are essential to monitor crop yield production 13 . A few approaches on disease detection and nutrient deficiencies in rice leaves have been developed and studied in recent times 14 , 15 , 16 , 16 , 17 , 18 . Hence, monitoring plant health, disease, and nutrition inadequacy could be a challenging image classification problem in artificial intelligence (AI) and machine learning (ML) 19 .

This paper proposes a deep learning method for plant health diagnosis by integrating a graph convolutional network (GCN) upon a backbone deep convolutional neural network (CNN). The complementary discriminatory features of different local regions of input leaf images are aggregated into a holistic representation for plant nutrition and disease classification. The GCNs were originally developed for semi-supervised node classification 20 . Over time, several variations of GCNs have been developed for graph structured data 21 . Furthermore, GCN is effective for message propagation for image and video data in various applications. In this direction, several works have been developed for image recognition using GCN 22 , 23 . However, little research attention has been given to adopting GCN especially for plant disease prediction and nutrition monitoring 24 . Thus, in this work, we have studied the effectiveness of GCN in solving the current problem of plant health analysis regarding nutrition deficiency and disease classification of several categories of plants.

The proposed method, called Plant Nutrition Deficiency and Disease Network (PND-Net), attempts to establish a correlation between different regions of the leaves for identifying infected and defective regions at multiple granularities. For this intent, region pooling in local contexts and spatial pooling in a pyramidal structure, have been explored for a holistic feature representation of subtle discrimination of plant health conditions. Other existing approaches have built the graph-based correlation directly upon the CNN features, but they have often failed to capture finer descriptions of the input data. In this work, we have integrated two different feature pooling techniques for generating node features of the graph. As a result, this mixing enables an enhanced feature representation which is further improved by graph layer activations in the hidden layers in the GCN. The effectiveness of the proposed strategy has been analysed with rigorous experiments on two plant nutrition deficiency and two plant disease classification datasets. In addition, the method has been tested on two different human cancer classification tasks for the generalization of the method. The key contributions of this work are:

A deep learning method, called PND-Net, is devised by integrating a graph convolutional module upon a base CNN to enhance the feature representation for improving the classification performances of unhealthy leaves.

A combination of fixed-size region-based pooling with multi-scale spatial pyramid pooling progressively enhances the feature aggregation for building a spatial relation between the regions via the neighborhood nodes of a spatial graph structure.

Experimental studies have been carried out for validating the proposed method on four public datasets, which have been tested for plant disease classification, and nutrition deficiency classification. For generalization of the proposed method, a few experiments have been conducted on the cervical cancer cell (SIPaKMeD) and breast cancer histopathology image (BreakHis 40 \(\times \) and 100 \(\times \) ) datasets. The proposed PND-Net has achieved state-of-the-art performances on these six public datasets of different categories.

The rest of this paper is organized as follows: “ Related works ” summarizes related works. “ Proposed method ” describes the proposed methodology. The experimental results are showcased in “ Results and performance analysis ”, followed by the conclusion in “ Conclusion ”.

Related works

Several works have been contributed to plant disease detection, most of which were tested on controlled datasets, acquired in a laboratory set-up. Only a few works have developed unconstrained datasets considering realistic field conditions, which have been studied in this work. Here, a precise study of recent works has been briefed.

Methods on plant nutrition deficiencies

Bananas are one of the widely consumed staple foods across the world. An image dataset depicting the visual deficiency symptoms of eight essential nutrients, namely, boron, calcium, iron, potassium, manganese, magnesium, sulphur and zinc has been developed 25 . This dataset has been tested in this proposed work. The CoLeaf dataset contains images of coffee plant leaves and is tested for nutritional deficiencies recognition and classification 26 . The nutritional status of oil palm leaves, particularly the status of chlorophyll and macro-nutrients (e.g., N, K, Ca, and Mg) in the leaves from proximal multi spectral images, have been evaluated using machine learning techniques 27 . The identification and categorization of common macro-nutrient (e.g., nitrogen, phosphorus, potassium, etc.) deficiencies in rice plants has been addressed 17 , 28 . The percentage of micro nutrients deficiencies in rice plants using CNNs and Random Forest (RF) has been estimated 28 . Detection of biotic stressed rice leaves and abiotic stressed leaves caused by NPK (Nitrogen, Phosphorus, and Potassium) deficiencies have been experimented with using CNN 29 .

A supervised monitoring system of tomato leaves for predicting nutrient deficiencies using a CNN for recognizing and to classify the type of nutrient deficiency in tomato plants and achieved 86.57% accuracy 30 . The nutrient deficiency symptoms have been recognized in RGB images by using CNN-based (e.g., EfficientNet) transfer learning on orange with 98.52% accuracy and sugar beet with 98.65% accuracy 31 . Nutrient deficiencies in rice plants have reported 97.0% accuracy by combining CNN and reinforcement learning 32 . The R-CNN object detector has achieved accuracy of 82.61% for identifying nutrient deficiencies in chili leaves 33 . Feature aggregation schemes by combining the features with HSV and RGB for color, GLCM and LBP for texture, and Hu moments and centroid distance for shapes have been examined for nutrient deficiency identification in chili plants 34 . However, this method performed the best using a CNN with 97.76% accuracy. An ensemble of CNNs has reported 98.46% accuracy for detecting groundnut plant leaf images 35 . An intelligent robotic system with a wireless control to monitor the nutrition essentials of spinach plants in the greenhouse has been evaluated with 86% precision 36 . The nutrient status and health conditions of the Romaine Lettuce plants in a hydroponic setup using a CNN have been tested with 90% accuracy 37 . The identification and categorization of common macro-nutrient (e.g., nitrogen, phosphorus, potassium, etc.) deficiencies in rice plants using pixel ratio analysis in HSV color space has been evaluated with more than 90% accuracy 17 . A method for estimating leaf nutrient concentrations of citrus trees using unmanned aerial vehicle (UAV) multi-spectral images has been developed and tested by a gradient-boosting regression tree model with moderate precision 38 .

Approaches on plant diseases

The classification of healthy and diseased citrus leaf images using a (CNN) on the Platform as a Service (PaaS) cloud has been developed. The method has been tested using pre-trained backbones and proposed CNN, and attained 98.0% accuracy and 99.0% F1-score 39 . A modified transfer learning (TL) method using three pre-trained CNN has been tested for potato leaf disease detection and the DensNet169 has achieved 99.0% accuracy 40 . Likewise, a CNN-based transfer learning method has been adapted for detecting powdery mildew disease with 98.0% accuracy in bell pepper leaves 41 , and woody fruit leaves with 85.90% accuracy 42 . A two-stage transfer learning method has combined Faster-RCNN for leaf detection and CNN for maize plant disease recognition in a natural environment and obtained 99.70% F1-score 43 . A hybrid model integrating a CNN and random forest (RF) for multi-classifying rice hispa disease into distinct intensity levels 44 . A method of multi-classification of rice hispa illness has attained accuracy of 97.46% using CNN and RF 44 . An improved YOLOv5 network has been developed for cucumber leaf diseases and pest detection and reported 73.8% precision 13 . A fusion of VGG16 and AlexNet architecture has attained 95.82% testing accuracy for pepper leaf disease classification 45 . Likewise, the disease classification of black pepper has gained 99.67% accuracy using ResNet-18 46 . A ConvNeXt with an attention module, namely CBAM-ConvNeXt has improved the performance with 85.42% accuracy for classifying soybean leaf disease 47 . A channel extension residual structure with an adaptive channel attention mechanism and a bidirectional information fusion block for leaf disease classification 48 . This technique has brought off 99.82% accuracy on the plantvillage dataset. A smartphone application has been developed for detecting habanero plant disease and obtained 98.79% accuracy 49 . In addition, an ensemble method for crop monitoring system to identify plant diseases at the early stages using IoT enabled system has been presented with the best precision of 84.6% 50 . A dataset comprising five types of disorders of apple orchards has been developed, and the best accuracy is 97.3%, which has been tested using CNN 51 . A lightweight model using ViT structure has been developed for rice leaf disease classification and attained 91.70% F1-score 52 .

Methods on graph convolutional networks (GCN)

Though several deep learning approaches have been developed for plant health analysis yet, little progress has been achieved using GCN for visual recognition of plant diseases 53 . The SR-GNN integrates relation-aware feature representation leveraging context-aware attention with the GCN module 22 . Cervical cell classification methods have been developed by exploring the potential correlations of clusters through GCN 54 and feature rank analysis 55 . On the other side, fusion of multiple CNNs, transfer learning and other deep learning methods have been developed for early detection of breast cancer 56 . This fusion method has achieved F1 score of 99.0% on ultrasound breast cancer dataset using VGG-16. In this work, a GCN-based method has been developed by capturing the regional importance of local contextual features in solving plant disease recognition and human cancer image classification challenges.

Proposed method

The proposed method, called PND-Net, combines deep features using CNN and GCN in an end-to-end pipeline as shown in Fig. 1 . Firstly, a backbone CNN computes high-level deep features from input images. Then, a GCN is included upon the CNN for refining deep features using region-based pooling and pyramid pooling strategies for capturing finer details of contextual regions as multiple scales. Finally, a precise feature map is built for improving the performance.

figure 1

Proposed GCN-based method, PND-Net for visual classification of plant disease and nutrition inadequacy.

Background of graph convolutional network (GCN)

GCNs have widely been used for several domains and applications such as node classification, edge attribute modeling, citation networks, knowledge graphs, and several other tasks through graph-based representation. A GCN could be formulated by stacking multiple graph convolutional layers with non-linearity upon traditional convolutional layers, i.e., CNN. In practice, this kind of stacking of GCN layers at a deeper level of a network enhances the model’s learning capacity. Moreover, graph convolutional layers are effective for alleviating overfitting issue and can address the vanishing gradient problem by adopting the normalization trick, which is a foundation of modeling GCN. A widely used multi-layer GCN algorithm was proposed by Kiff and Welling 20 , which has been adopted here. It explores an efficient and fast layer-wise propagation method relying on the first-order approximation of spectral convolutions on graph structures. It is scalable and apposite for semi-supervised node classification from graph-based data. A linear formulation of a GCN could be simplified which, in turn, is capable of parameter optimization at each layer by convolution with filter \(g_{\theta }\) and \(\theta \) parameters, which can further be optimized with a single parameter. Here, a simplified graph convolution has been concisely defined 20 .

The graph Laplacian ( \(\Psi \) ) could further be normalized to mitigate the vanishing gradients within a network.

where the binary adjacency matrix \(\tilde{\textbf{A}}=\textbf{A}+\textbf{I}_{{P}}\) denotes \(\textbf{A}\) with self-connections and \(\textbf{I}_{{P}}\) is the identity matrix, and degree matrix is \(\tilde{\textbf{D}}_{ii}= \sum _{j}^{\hspace{0.2 cm}} \tilde{\textbf{A}}_{ij}\) , and X is an input data/signal to the graph. The simplified convoluted signal matrix \(\Omega \) is given as

where input features \(X \in \mathbb {R}^{{P}\times {C}}\) , filter parameters \(\Theta \in \mathbb {R}^{{C}\times {F}}\) , and \(\Omega \in \mathbb {R}^{{P}\times {F}}\) is the convoluted signal matrix. Here, P is the number of nodes, C is the input channels, F is the filters/feature maps. Now, this form of graph convolution (Eq. 3 ) is applied to address the current problem and is described in “namerefsec33”.

Convolutional feature representation

A standard backbone CNN is used for deep feature extraction from an input leaf image, denoted with the class label \(I_l\) \(\in \) \(\mathbb {R}^{h\times w\times 3}\) is passed through a base CNN for extracting the feature map, denoted as \(\textbf{F}\) \(\in \) \(\mathbb {R}^{h\times w\times C}\) where h , w , and C imply the height, width, and channels, respectively. However, the squeezed high-level feature map is not suitable for describing local non-overlapping regions. Hence, the output base feature map is spatially up-sampled to \(\textbf{F}\) \(\in \) \(\mathbb {R}^{H\times W\times C}\) and \(\omega \) number of distinct small regions are computed, given as \(\textbf{F}\) \(\in \) \(\mathbb {R}^{\omega \times h\times w\times C}\) . These regions represents complementary information at different spatial contexts. However, due to fixed dimensions of regions, the importance of each region is uniformly distributed, which could be tuned further for extracting more distinguishable information. A simple pooling technique could further be applied at multiple scales for enhancing the spatial feature representation. For this intent, the region-pooled feature vectors are reshaped to convert them into an aggregated spatial feature space upon which multi-scale pyramidal pooling is possible. In addition, this kind of feature representation captures overall spatiality to understand the informative features holistically and solve the current problem.

Spatial pyramid pooling (SPP)

The SPP layer was originally introduced to alleviate the fixed-length input constraints of conventional deep networks, which effectively boosted the model’s performance 57 . Generally, a SPP layer is added upon the last convolutional layer of a backbone CNN. This pooling layer generates a fixed-length feature vector and afterward passes the feature map to a fully connected or classification layer. The SPP enhances feature aggregation capability at a deeper layer of a network. Most importantly, SPP applies multi-level spatial bins for pooling while preserving the spatial relevance of the feature map. It provides a robust solution through performance enhancement of diverse computer vision problems, including plant/leaf image recognition.

A typical region polling technique loses its spatial information while passing though a global average pooling (GAP) layer for making compatible with and plugging in the GCN. As a result, a region pooling with a GAP layer aggressively eliminates informativeness of regions and their correlation, and thus often it fails to build an effective feature vector. Also, the inter-region interactions are ignored with a GAP layer upon only region-based pooling. Therefore, it is essential to correlate the inter-region interactions for selecting essential features, which could further be enriched and propagated through the GCN layer activations.

Our objective is to utilize the advantage of multi-level pooling at different pyramid levels of \(n \times n\) bins on the top of fixed-size regions of the input image. As a result, the spatial relationships between different image regions are preserved, thereby escalating the learning capacity of the proposed PND-Net. The input feature space prior to pyramid pooling is given as \(\textbf{F}^{\omega \times (HW)\times C}\) , which has been derived from \(\textbf{F}^{\omega \times H\times W\times C}\) . It enables the selection of contextual features of neighboring regions (i.e., inter-regions) through pyramid pooling simultaneously. This little adjustment in the spatial dimension of input features prior to pooling captures the interactions between the local regions of input leaf disease. Experimental results reflect that pyramidal pooling indeed elevates image classification accuracy gain over region pooling only.

where \(\delta _i\) and \(\delta _j\) define the window sizes, which enable to pool a total of \(P=(i\times i) + (j\times j)\) feature maps after SPP, given as \(\textbf{F}^{P\times C}\) . These feature maps are further fed into a GCN module, described next. The key components of proposed method are pictorially ideated in Fig. 1 .

Graph convolutional network (GCN)

A graph \(G=({P},E)\) , with P nodes and E edges, is constructed for feature propagation. A GCN is applied for building a spatial relation between the features through graph G . The nodes are characterized by deep feature maps, and the output \(\textbf{C}\) with the convoluted features per node. The edges E are described by an un-directed adjacency matrix \(\textbf{A} \in \mathbb {R}^{{P}\times {P}}\) for representing node-level interactions. This graph convolution has been applied to \(F_{SPP}\) (i.e., \(\textbf{F}^{P\times C}\) ), described above. The layer-wise feature propagation rule is defined as:

\(l=0, 1, \dots , L-1\) is the number of layers, \(\textbf{W}^{(l)}\) is a weight matrix for the l -th layer. A non-linear activation function ( e.g. , ReLU) is denoted by \(\sigma (.)\) . The symmetrically normalized adjacency matrix is \(\hat{\tilde{\textbf{A}}}=Q\tilde{\textbf{A}}Q; \) and \(Q=\tilde{\textbf{D}}^{-1/2}\) is the diagonal node degree matrix of \(\tilde{\textbf{A}}\) (defined in Eq. 3 ). Next, the reshaped convolutional feature map \({\textbf {F}}\) is fed into two layers of graph convolutions, subsequently which is capable of capturing local neighborhoods via the non-linear activations of rectified linear unit (ReLU) in the graph convolutional layers. The dimension of the output feature maps remains the same input of GCN layers, i.e., \( \textbf{G}^{(L)}\rightarrow {\textbf {F}}_{G}\) \(\in \mathbb {R}^{{P}\times {C}}\) . However, the node features could be squeezed to a lower dimension, which may lose essential information pertinent to spatial modeling. Hence, the channel dimension is kept uniform within the network pipeline in our study. Afterward, the graph-based transformed feature maps ( \({\textbf {F}}_{G}\) ) are pooled using a GAP for selecting the most discriminative channel-wise feature maps of the nodes.

figure 2

Sample images of banana dataset showing the nutrition deficiency of iron, calcium, and magnesium.

figure 3

Sample images of coffee nutrition deficiency of boron, manganese, and nitrogen.

Classification module

Generally, regularization is a standard way to tackle the training-related challenges of any network, such as overfitting. Here, the layer normalization and dropout layers are interposed for handling overfitting issues as a regularization technique. Lastly, \({F}_{final}\) is passes through a softmax layer for computing the output probability of the predicted class-label \(\bar{b}\) , corresponding to the actual-label \(b \in Y\) of object classes Y .

The categorical cross-entropy loss function ( \(\mathscr {L}_{CE}\) ) and the stochastic gradient descent (SGD) optimizer with \(10^{-3}\) learning rate has been chosen for experiments.

where \(Y_i\) is the actual class label and \(log\hat{Y}_i\) is the predicted class label by using softmax activation function \(\sigma (.)\) in the classification layer, and N is the total number of classes.

Results and performance analysis

At first, the implementation description is provided, followed by a summary of datasets. The experiments have been conducted using conventional classification and cross validation methods. The performances are evaluated using the standard well-known metrics: accuracy, precision, recall, and F1-score (Eq. 8 ).

where TP is the number of true positive, TN is the number of true negative, FP is the number of false positive, and FN is the number of false negative. However, accuracy is not a good assessment metric when the data distributions among the classes are imbalanced. To overcome such misleading evaluation, the precision and recall are useful metrics, based on which F1-score is measured. These three metrics are widely used for evaluating the predictive performance when classes are imbalanced. In addition, we have evaluated the performance using confusion matrix which provides a reliable performance assessment of our model. The performances have been compared with existing methods, discussed below.

Implementation details

A concise description about the model development regarding the hardware resources, software implementation data distribution, evaluation protocols, and related details are furnished below for easier understanding.

Summary of convolutional network architectures

The Inception-V3, Xception, ResNet-50, and MobileNet-V2 backbone CNNs with pre-trained ImageNet weights are used for convolutional feature computation from the input images. The Inception module focuses on increasing network depth using 5 \(\times \) 5, 3 \(\times \) 3, and 1 \(\times \) 1 convolutions 58 . Again, 5 \(\times \) 5 convolution has been replaced by factorizing into 3 \(\times \) 3 filter sizes 59 . Afterward, the Inception module is further decoupled the channel-wise and spatial correlations by point-wise and depth-wise separable convolutions, which are the building block of Xception architecture 60 . The separable convolution follows the depth-wise convolution for spatial (3 \(\times \) 3 filters) and point-wise convolution (1 \(\times \) 1 filters) for cross-channel aggregation into a single feature map. The Xception is a three-fold architecture developed with depth-wise separable convolution layers with residual connections. Whereas, the residual connection a.k.a. shortcut connection is the central idea of deep residual learning framework, widely known as ResNet architecture 61 . The residual learning represents an identity mapping through a shortcut connection following simple addition of feature maps of previous layers rendered using 3 \(\times \) 3 and 1 \(\times \) 1 convolutions. This identity mapping does not incur additional computational overhead and still able to ease degradation problem. In a similar fashion, the MobileNet-V2 uses bottleneck separable convolutions with kernel size 3 \(\times \) 3, and inverted residual connection 62 . It is a memory-efficient framework suitable for mobile devices.

These backbones are widely used in existing works on diverse image classification problems (e.g., human activity recognition, object classification, disease prediction, etc.) due to their superior architectural designs 63 , 64 at reasonable computational cost. Here, these backbones are used for a fair performance comparison with the state-of-the-art methods developed for plant nutrition and disease classification 65 . We have customized the top-layers of base CNNs for adding the GCN module without alerting their inherent layer-wise building blocks, convolutional design such as the kernel-sizes, skip-connections, output feature dimension, and other design parameters. The basic characteristics of these backbone CNNs are briefed in Table 1 . The network depth, model size and parameters have been increased due to the addition of GCN layers upon the base CNN accordingly, evident in Table 1 .

Two GCN layers have been used with ReLU activation, and the feature size is the same as the base CNN’s output channel dimension. For example, the size of channel features of ResNet-50, Xception and Inception-V3 is 2048, which is kept the same dimension as GCN’s channel feature map. The adjacency matrix is developed considering overall spatial relation among different neighborhood regions as a complete graph. Therefore, each region is related with all other regions even if they are far apart which is helpful in capturing long-distant feature interactions and building a holistic feature representation via a complete graph structure. Batch normalization and a drop-out rate of 0.3 is applied in the overall network design to reduce overfitting.

Data pre-processing and data splitting techniques

The basic pre-processing technique provided by the Keras applications for each backbone has been applied. It is required to convert the input images from RGB to BGR, and then each color channel is zero-centered with respect to the ImageNet dataset, without any scaling. Data augmentation methods such as random rotation (± 25 \(\circ \) ), scaling (± 0.25), Gaussian blur, and random cropping with 224 \(\times \) 224 image-size from the input size of 256 \(\times \) 256 are applied on-the-fly for data diversity in image samples.

We have maintained the same train-test split provided with the datasets e.g., PlantDoc. However, other plant datasets does not provide any specific image distribution. Thus, we have randomly divided the datasets into train and test samples following a 70:30 split ratio which is complied in several works. The details of image distribution is provided in Table 3 . For cross-validation, we have randomly divided the training samples into training and validation set with a 4:1 ratio i.e., five-fold cross validation in a disjoint manner, which is a standard techniques adopted in other methods 66 . The test set remains unaltered for both evaluation schemes for clear performance comparison. Finally, the average test accuracy of five executions on each dataset has been reported here as the overall performance of the PND-Net.

A summary of the implementation specification indicating the hardware and software environments, training hyper-parameters, data augmentations, and estimated time (milliseconds) of training and inference are specified in Table 2 . Our model is trained with a mini-batch size of 12 for 150 epochs and divided by 5 after 100 epochs. However, no other criterion such as early stopping has been followed. The proposed method is developed in Tensorflow 2.x using Python.

Dataset description

figure 4

Sample images of potato diseases infected by bacteria, pest, and nematodes.

figure 5

Sample images of infected leaves of soybean, tomato, and bell pepper from the PlantDoc dataset.

The summary of four plant datasets used in this work are summarized in Table 3 . These datasets are collected from public repositories such as the Mendeley Data and Kaggle.

The Banana nutrition deficiency dataset represents healthy samples and the visual symptoms of deficiency of the: Boron, Calcium, Iron, Magnesium, Potassium, Sulphur, and Zinc. The samples of this dataset are shown in Fig. 2 . More details are provided in Ref 25 .

The Coffee nutrition deficiency dataset (CoLeaf-DB) 26 represents healthy samples and the deficiency classes are: Boron, Calcium, Iron, Manganese, Magnesium, Nitrogen, Potassium, Phosphorus, and more deficiencies. The samples of dataset are illustrated in Fig. 3 .

The Potato disease classes are: Virus, Phytopthora, Pest, Nematode, Fungi, Bacteria, and healthy. The samples of this dataset are shown in Fig. 4 . The dataset is collected from the Mendeley 67 repository.

The PlantDoc is a realistic plant disease dataset 65 , comprising with different disease classes of Apple, Tomato, Potato, Strawberry, Soybean, Raspberry, Grapes, Corn, Bell-pepper, and others. Examples are shown in Fig. 5 .

The Breast Cancer Histopathology Image Classification (BreakHis) 68 dataset with 40 \(\times \) and 100 \(\times \) magnifications contain 8-classes: adenosis, fibroadenoma, phyllodes tumor, and tubular adenoma; ductal carcinoma, lobular carcinoma, mucinous carcinoma, and papillary carcinoma. The samples of this dataset are exemplified in Fig. 6 .

The SIPaKMeD 69 , containing 4050 single-cell images, which is useful for classifying cervical cells in Pap smear images, shown in Fig. 7 . This dataset is categorized into five classes based on cytomorphological features.

figure 6

Sample images of the BreakHis-40 \(\times \) dataset.

figure 7

Sample images of the SIPaKMeD dataset.

Result analysis and performance comparison

A summary of the datasets with data distribution, and the baseline accuracy (%) achieved by aforesaid base CNNs are briefed in Table 3 . The baseline model is developed using the pre-trained CNN backbones with ImageNet weights. A backbone CNN extracts the base output feature map which is pooled by a global average pooling layer and classified with a softmax layer. Four backbone CNNs with different design characteristics have used for generalizing our proposed method. The baseline accuracies are reasonable and consistent across various datasets, evident in Table 3 .

Two different evaluation strategies i.e., general classification and k -fold cross validation ( \(k=5\) ) have been experimented. An average performance has been estimated from multiple executions on each dataset and reported here. The top-1 accuracies (%) of the proposed PND-Net comprising two-GCN layers with the feature dimension 2048, included on the top of different backbone CNNs, are given in Table 4 . The overall performance of the PND-Net on all datasets significantly improved over the baselines. Clearly, it shows the efficiency of the proposed method. In addition, the PND-Net model has been tested with five-fold cross validation for a robust performance analysis (“Fivefold cross validation experiments”). These cross-validation results (Tables 5 , 6 and 7 ) on each dataset could be considered as the benchmark performances using several metrics. Our method has driven the state-of-the-art performances on these datasets for plant disease and nutrition deficiency recognition.

An experimental study has been carried out on two more public datasets for human medical image analysis. The BreakHis with 40 \(\times \) and 100 \(\times \) magnifications 68 and SIPaKMeD 69 datasets have been evaluated for generalization. The SIPaKMeD dataset 69 is useful for classifying cervical cells in pap smear images, illustrated in Fig. 7 . This dataset is categorized into five classes based on cytomorphological features using the proposed PND-Net. The conventional classification results are given in Table 4 , and the performances of cross validations are provided in Tables 6 and 7 .

Fivefold cross validation experiments

The fivefold cross-validation experiments on various datasets have been conducted for evaluating the performance of PND-Net using the ResNet-50 and Xception backbones, and the results are given in Table 5 , 6 , and 7 . The actual train set is divided into five disjoint subsets of images for each dataset. In each experiment, four out of five subsets are used for training and the remaining one is validated independently. Finally, the average validation result of five folds is reported.

The results of five-fold cross validation on potato leaf disease dataset are provided in Table 5 . The numbers of potato leaf images in each fold for training, validation, and testing are 1608, 402, and 869, respectively. The results using different metrics are computed and the last row implies an average performance of cross validation on this dataset.

Likewise, the performance five-fold validation on the BreakHis-40 \(\times \) dataset has been presented in Table 6 . In this experiment, the number of training samples in each fold is 1280 images, and validation set containing remaining 320 images. The test set contains 400 images which remains the same as used in aforesaid other experiments. Each of the five-fold experiment has been validated and tested on the test set. Lastly, an average result of five-fold cross validation has been computed, and given in the last row of Table 6 .

A similar experimental set-up of five-fold cross validation has been followed for other datasets. The average performances of PND-Net on these datasets are provided in Table 7 . The average cross-validation results are better than the conventional classification approach on the potato disease (ResNet-50: 94.48%) and BreakHis-40 \(\times \) (ResNet-50: 97.10%) datasets. The reason could be the variations in the validation set in each fold enhances the learning capacity of model due to training data diversity. As a result, improved performances have been achieved on diverse datasets. The results are consistent with the results of conventional classification method on other datasets as described above. The overall performances on different datasets validates the generalization capability of the proposed PND-Net.

figure 8

Confusion matrices have been computed using the proposed PND-Net with ResNet-50 backbone on: ( a ) top-row: PlantDoc; ( b ) bottom-row: potato, coffee, and banana datasets.

figure 9

Confusion matrix on the BreakHis-40 \(\times \) dataset (left) and smear PAP cell dataset (right) using the proposed PND-Net built upon the Xception backbone.

figure 10

The t-SNE plots on the Potato leaf dataset using PND-Net with ResNet-50 (left) and Inception-V3 (right).

figure 11

The Grad-CAM output of various datasets are shown, from left to right: nutrition deficiency, potato and corn diseases, and breast cancer. The top-row shows an original image and its corresponding Grad-CAM image is shown in the bottom row.

Model complexity and visualization

The model parameters are computed in millions, as provided in Table 8 . The model parameters have been estimated for three cases: (a) baseline i.e., the backbone CNN only; and the output feature dimension of GCN layers is (b) 1024 and (c) 2028. An average computational time of PND-Net using ResNet-50 has been estimated. The training time is 15.4 ms per image, and inference time is 5.8 ms per image, and model size is 122MB (given in Table 2 ). The confusion matrices on these four plant datasets are shown in Fig. 8 , indicating an overall performance using ResNet-50. Also, the feature map distributions are clearly shown in different clusters in the t-SNE diagrams 70 represented with two backbone models on the potato leaf dataset, shown in Fig. 10 . The gradient-weighted class activation mapping (Grad-CAM) 71 has been illustrated in Fig. 11 for visual explanations which clearly show the discriminative regions of different images.

Performance comparison

The highest accuracy on Banana nutrition classification was 78.76% and 87.89% using the raw dataset and an augmented version of the original dataset 72 . In contrast, our method has attained 84.0% using lightweight MobileNet-V2 and the best 90.0% using ResNet-50 on the raw dataset, implying a significant improvement in accuracy on this dataset.

The performances of PND-Net on the Coleaf-DB (Coffee dataset) are very similar, and the best accuracy (90.54%) is attained by the Xception. The differences of performances with other base CNNs are very small, implying a consistent performance. The elementary result using ResNet-50 reported on this recent public dataset is 87.75% 26 . Thus, our method has set new benchmark results on Coleaf-DB for further enhancement in the future. Likewise, the Potato Leaf Disease dataset is a new one 67 , collected from Mendeley data source. We are the first to provide in-depth results on this realistic dataset acquired in an uncontrolled environment.

A deep learning method has attained 81.53% accuracy using Xception and 78.34% accuracy using Inception-V3 backbone on the PlantDoc dataset 73 . In contrast, our PND-Net has attained 84.30% accuracy using Xception and 81.0% using Inception-V3, respectively. It evinces that PND-Net is more effective in discriminating plant diseases compared to the best reported existing methods. Clearly, the proposed graph-based network (PND-Net) is capable of distinguishing different types of nutrition deficiencies and plant diseases with a higher success rate in real-world public datasets.

The BreakHis dataset has been studied for categorizing into 4-classes and binary classification in several existing works. However, we have compared it with the works of classifying into 8 categories at the image-level for a fair comparison. The top-1 accuracy attained using Xception is 94.83%, whereas the state-of-the-art accuracy on this dataset is 93.40±1.8% achieved using a hybrid harmonization technique 74 . The accuracy reported is 92.8 ±2.1% using a class structure-based deep CNN 75 . The cross-validation results (ResNet-50: 97.10%) are improved over existing methods.

Several deep learning methods have been experimented with the SIPaKMeD dataset. A CNN-based method achieved 95.35 ± 0.42% accuracy 69 , a PCA-based technique obtained 97.87% accuracy for 5-class classification 76 , 98.30% using Xception 77 , and 98.26% using DarkNet-based exemplar pyramid deep model 78 . A GCN-based method has reported 98.37± 0.57% accuracy 54 . A few more comparative results have been studied in Ref 79 . In contrast, our method has achieved 98.98 ± 0.20% accuracy and 99.10% test accuracy with cross validation using Xception backbone on this dataset. The confusion matrices on both human disease datasets are shown in Fig. 9 . Overall rigorous experimental results imply that the proposed method has achieved state-of-the-art performances on different types of datasets representing plant nutrition deficiency, plant disease, and human disease classification.

Ablation study

An in-depth ablation study has been carried out to observe the efficacy of key components of the PND-Net. Firstly, the significance of computing different local regions is studied. These fixed-size regional descriptors are combined to create for a holistic representation of feature maps over the baseline features. Notably, the region pooling technique has improved overall performances on all the datasets, e.g., the gain is more than 12% on the Banana nutrition deficiency dataset using ResNet-50 backbone. The results of this study are provided in Table 9 .

Afterward, a component-level study has been evaluated by removing a module from the proposed PND-Net to observe the influence of the key component in performance. An ablation study depicting the significance of spatial pyramid pooling (SPP) layer has been conducted, and the results are shown in Table 10 . As the selection of discriminatory information at multiple pyramidal structures has been avoided, the model might overlook finer details which could have been captured at multiple scales by the SPP layer. It causes an obvious degradation of the capacity of network architecture, which is evident from the performances. Thus, capturing multi-scale features is useful to select relevant features for effective learning of plant health conditions.

Next, the efficacious GCN modules are excluded from the network architecture, and then, experiments have been conducted with regional features selected by our composite pooling modules (i.e., regions + SPP) from upsampled high-level deep features of a base CNN. The results are provided in Table 11 .

figure 12

( a ) The performances of various formulations of the numbers of regions and spatial pyramid pooling feature vectors; ( b ) the performances of different channel-wise node features within GCN layers activation and propagation in the proposed method using the ResNet-50 backbone.

It is evident that the GCN module indeed improves performance remarkably. In the case of the Banana dataset using Xception backbone, the accuracy of PND-Net is 89.25%. Whereas, averting GCN layers, the degraded accuracy is 81.46%, implying 7.79% drop in accuracy. Even though, one GCN layer (Banana: 86.0%) does not suffice to render the state-of-the-art performance on these plant datasets. The results of considering one layer GCN on all datasets are demonstrated in Table 12 . Indeed, two layers in GCN are beneficial in enhancing the performance over one GCN layer, which is evident in the literature 22 . Hence, two GCN layers are included in the proposed PND-Net model architecture.

A comparative study on different number of regions and the number of pyramid pooled feature vectors using ResNet-50 is shown in Fig. 12 a, which clearly implies a gradual improvement in accuracy on the PlantDoc and Banana datasets. Lastly, the influences of different feature vector sizes in GCN layer activations have been studied. In this study, the channel dimensions of feature vectors 1024 and 2048 have been chosen for building the graph structures using ResNet-50 backbone, implying the same channel dimensions have been considered in the PND-Net architecture. The results (Fig. 12 b of such variations provide insightful implications about the performance of GCN layers.

The performances of PND-Net with GCN output feature vector size of 1024 have been summarized in Table 13 . The results are very competitive with GCN’s size of 2048. Thus, the model with 1024 GCN feature size could be preferred considering a trade off between the model parametric capacity with the performance. The detailed experimental studies imply overall performance boost on all datasets, and the proposed PND-Net achieves state-of-the-art results. In addition, new public datasets have been benchmarked for further enhancement.

However, other categories of images such as high resolution, hyperspectral, etc. have not been evaluated. One reason is unavailability of such plant datasets for public research. Also, data modalities such as soil-sensor information could be utilized for developing fusion based approaches. Several existing ensemble methods have used multiple backbones, which suffer from a higher computational complexity. Though, our method performs better than several existing works, yet, the computational complexity regarding model parameters and size of PND-Net could be improved. The reason is plugging the GCN module upon the backbone CNN, which incurs more parameters. To address this challenge, the graph convolutional layer could be simplified for reducing the model complexity. In addition, more realistic agricultural datasets representing field conditions such as occlusion, cluttered backgrounds, lighting variations, and others could be developed. These limitations of the proposed PND-Net will be explored in the near future.

In this paper, a deep network called PND-Net has been proposed for plant nutrition deficiency recognition using a GCN module, which is added on the top a CNN backbone. The performances have been evaluated on four image datasets representing the plant nutrition deficiencies and leaf diseases. These datasets have recently been introduced publicly for assessment. The network has been generalized by building the deep network using four standard backbone CNNs, and the network architecture has been improved by incorporating pyramid pooling over region-pooled feature maps and feature propagation via a GCN. We are the first to evaluate these nutrition inadequacy datasets for monitoring plant health and growth. Our method has attained the state-of-the-art performance on the PlantDoc dataset for plant disease recognition. We encourage the researcher for further enhancement on these public datasets for early stage detection of plant abnormalities, essential for sustainable agricultural growth. Furthermore, experiments have been conducted on the BreakHis (40 \(\times \) and 100 \(\times \) magnifications) and SIPaKMeD datasets, which are suitable for human health diagnosis. The proposed PND-Net have attained enhanced performances on these datasets too. In the future, new deep learning methods would be developed for early stage disease detection of plants and health monitoring with balanced nutrition using other data modalities and imaging techniques.

Data availability

The six datasets that support the findings which were used in this work are available using the given links. The Nutrient Deficient of Banana Plant dataset 25 is collected from https://data.mendeley.com/datasets/7vpdrbdkd4/1 . The CoLeaf-DB dataset 26 for coffee leaf nutrition deficiency classification is available at https://data.mendeley.com/datasets/brfgw46wzb/1 . The Potato Leaf Disease Dataset 67 is available at https://data.mendeley.com/datasets/ptz377bwb8/1 . The PlantDoc dataset 65 is available at https://github.com/pratikkayal/PlantDoc-Dataset . The BreakHis dataset 68 is available at https://web.inf.ufpr.br/vri/databases/breast-cancer-histopathological-database-breakhis/ , and can also be downloaded from https://data.mendeley.com/datasets/jxwvdwhpc2/1 . The original SIPaKMeD dataset 69 can be found at https://www.cs.uoi.gr/marina/sipakmed.html , and Kaggle https://www.kaggle.com/datasets/mohaliy2016/papsinglecell .

Jung, M. et al. Construction of deep learning-based disease detection model in plants. Sci. Rep. 13 , 7331 (2023).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Aiswarya, J., Mariammal, K. & Veerappan, K. Plant nutrient deficiency detection and classification-a review. In 2023 5th International Conference Inventive Research in Computing Applications (ICIRCA) . 796–802 (IEEE, 2023).

Yan, Q., Lin, X., Gong, W., Wu, C. & Chen, Y. Nutrient deficiency diagnosis of plants based on transfer learning and lightweight convolutional neural networks Mobilenetv3-large. In Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition . 26–33 (2022).

Sudhakar, M. & Priya, R. Computer vision based machine learning and deep learning approaches for identification of nutrient deficiency in crops: A survey. Nat. Environ. Pollut. Technol. 22 (2023).

Noon, S. K., Amjad, M., Qureshi, M. A. & Mannan, A. Use of deep learning techniques for identification of plant leaf stresses: A review. Sustain. Comput. Inform. Syst. 28 , 100443 (2020).

Google Scholar  

Waheed, H. et al. Deep learning based disease, pest pattern and nutritional deficiency detection system for “Zingiberaceae’’ crop. Agriculture 12 , 742 (2022).

Article   Google Scholar  

Barbedo, J. G. A. Detection of nutrition deficiencies in plants using proximal images and machine learning: A review. Comput. Electron. Agric. 162 , 482–492 (2019).

Shadrach, F. D., Kandasamy, G., Neelakandan, S. & Lingaiah, T. B. Optimal transfer learning based nutrient deficiency classification model in ridge gourd ( Luffa acutangula ). Sci. Rep. 13 , 14108 (2023).

Sathyavani, R., JaganMohan, K. & Kalaavathi, B. Classification of nutrient deficiencies in rice crop using DenseNet-BC. Mater. Today Proc. 56 , 1783–1789 (2022).

Article   CAS   Google Scholar  

Haris, S., Sai, K. S., Rani, N. S. et al. Nutrient deficiency detection in mobile captured guava plants using light weight deep convolutional neural networks. In 2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC) . 1190–1193 (IEEE, 2023).

Munir, S., Seminar, K. B., Sukoco, H. et al. The application of smart and precision agriculture (SPA) for measuring leaf nitrogen content of oil palm in peat soil areas. In 2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE) . 650–655 (IEEE, 2023).

Lu, J., Peng, K., Wang, Q. & Sun, C. Lettuce plant trace-element-deficiency symptom identification via machine vision methods. Agriculture 13 , 1614 (2023).

Omer, S. M., Ghafoor, K. Z. & Askar, S. K. Lightweight improved YOLOv5 model for cucumber leaf disease and pest detection based on deep learning. In Signal, Image and Video Processing . 1–14 (2023).

Kumar, A. & Bhowmik, B. Automated rice leaf disease diagnosis using CNNs. In 2023 IEEE Region 10 Symposium (TENSYMP) . 1–6 (IEEE, 2023).

Senjaliya, H. et al. A comparative study on the modern deep learning architectures for predicting nutritional deficiency in rice plants. In 2023 IEEE IAS Global Conference on Emerging Technologies (GlobConET) . 1–6 (IEEE, 2023).

Ennaji, O., Vergutz, L. & El Allali, A. Machine learning in nutrient management: A review. Artif. Intell. Agric. (2023).

Rathnayake, D., Kumarasinghe, K., Rajapaksha, R. & Katuwawala, N. Green insight: A novel approach to detecting and classifying macro nutrient deficiencies in paddy leaves. In 2023 8th International Conference Information Technology Research (ICITR) . 1–6 (IEEE, 2023).

Asaari, M. S. M., Shamsudin, S. & Wen, L. J. Detection of plant stress condition with deep learning based detection models. In 2023 International Conference on Energy, Power, Environment, Control, and Computing (ICEPECC) . 1–5 (IEEE, 2023).

Tavanapong, W. et al. Artificial intelligence for colonoscopy: Past, present, and future. IEEE J. Biomed. Health Inform. 26 , 3950–3965 (2022).

Article   PubMed   PubMed Central   Google Scholar  

Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (2017).

Zhang, S., Tong, H., Xu, J. & Maciejewski, R. Graph convolutional networks: A comprehensive review. Comput. Soc. Netw. 6 , 1–23 (2019).

Bera, A., Wharton, Z., Liu, Y., Bessis, N. & Behera, A. SR-GNN: Spatial relation-aware graph neural network for fine-grained image categorization. IEEE Trans. Image Process. 31 , 6017–6031 (2022).

Article   ADS   Google Scholar  

Qu, Z., Yao, T., Liu, X. & Wang, G. A graph convolutional network based on univariate neurodegeneration biomarker for Alzheimer’s disease diagnosis. IEEE J. Transl. Eng. Health Med. (2023).

Khlifi, M. K., Boulila, W. & Farah, I. R. Graph-based deep learning techniques for remote sensing applications: Techniques, taxonomy, and applications—A comprehensive review. Comput. Sci. Rev. 50 , 100596 (2023).

Article   MathSciNet   Google Scholar  

Sunitha, P., Uma, B., Channakeshava, S. & Babu, S. A fully labelled image dataset of banana leaves deficient in nutrients. Data Brief 48 , 109155 (2023).

Tuesta-Monteza, V. A., Mejia-Cabrera, H. I. & Arcila-Diaz, J. CoLeaf-DB: Peruvian coffee leaf images dataset for coffee leaf nutritional deficiencies detection and classification. Data Brief 48 , 109226 (2023).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Chungcharoen, T. et al. Machine learning-based prediction of nutritional status in oil palm leaves using proximal multispectral images. Comput. Electron. Agric. 198 , 107019 (2022).

Bhavya, T., Seggam, R. & Jatoth, R. K. Fertilizer recommendation for rice crop based on NPK nutrient deficiency using deep neural networks and random forest algorithm. In 2023 3rd International Conference on Artificial Intelligence and Signal Processing (AISP) . 1–5 (IEEE, 2023).

Dey, B., Haque, M. M. U., Khatun, R. & Ahmed, R. Comparative performance of four CNN-based deep learning variants in detecting Hispa pest, two fungal diseases, and npk deficiency symptoms of rice ( Oryza sativa ). Comput. Electron. Agric. 202 , 107340 (2022).

Cevallos, C., Ponce, H., Moya-Albor, E. & Brieva, J. Vision-based analysis on leaves of tomato crops for classifying nutrient deficiency using convolutional neural networks. In 2020 International Joint Conference on Neural Networks (IJCNN) . 1–7 (IEEE, 2020).

Espejo-Garcia, B., Malounas, I., Mylonas, N., Kasimati, A. & Fountas, S. Using Efficientnet and transfer learning for image-based diagnosis of nutrient deficiencies. Comput. Electron. Agric. 196 , 106868 (2022).

Wang, C., Ye, Y., Tian, Y. & Yu, Z. Classification of nutrient deficiency in rice based on cnn model with reinforcement learning augmentation. In 2021 International Symposium on Artificial Intelligence and its Application on Media (ISAIAM) . 107–111 (IEEE, 2021).

Bahtiar, A. R., Santoso, A. J., Juhariah, J. et al. Deep learning detected nutrient deficiency in chili plant. In 2020 8th International Conference on Information and Communication Technology (ICoICT) . 1–4 (IEEE, 2020).

Rahadiyan, D., Hartati, S., Nugroho, A. P. et al. Feature aggregation for nutrient deficiency identification in chili based on machine learning. Artif. Intell. Agric. (2023).

Aishwarya, M. & Reddy, P. Ensemble of CNN models for classification of groundnut plant leaf disease detection. Smart Agric. Technol. 100362 (2023).

Nadafzadeh, M. et al. Design, fabrication and evaluation of a robot for plant nutrient monitoring in greenhouse (case study: iron nutrient in spinach). Comput. Electron. Agric. 217 , 108579 (2024).

Desiderio, J. M. H., Tenorio, A. J. F. & Manlises, C. O. Health classification system of romaine lettuce plants in hydroponic setup using convolutional neural networks (CNN). In 2022 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET) . 1–6 (IEEE, 2022).

Costa, L., Kunwar, S., Ampatzidis, Y. & Albrecht, U. Determining leaf nutrient concentrations in citrus trees using UAV imagery and machine learning. Precis. Agric. 1–22 (2022).

Lanjewar, M. G. & Parab, J. S. CNN and transfer learning methods with augmentation for citrus leaf diseases detection using PaaS cloud on mobile. Multimed. Tools Appl. 1–26 (2023).

Lanjewar, M. G., Morajkar, P. P. Modified transfer learning frameworks to identify potato leaf diseases. Multimed. Tools Appl. 1–23 (2023).

Dissanayake, A. et al. Detection of diseases and nutrition in bell pepper. In 2023 5th International Conference on Advancements in Computing (ICAC) . 286–291 (IEEE, 2023).

Wu, Z., Jiang, F. & Cao, R. Research on recognition method of leaf diseases of woody fruit plants based on transfer learning. Sci. Rep. 12 , 15385 (2022).

Liu, H., Lv, H., Li, J., Liu, Y. & Deng, L. Research on maize disease identification methods in complex environments based on cascade networks and two-stage transfer learning. Sci. Rep. 12 , 18914 (2022).

Kukreja, V., Sharma, R., Vats, S. & Manwal, M. DeepLeaf: Revolutionizing rice disease detection and classification using convolutional neural networks and random forest hybrid model. In 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT) . 1–6 (IEEE, 2023).

Bezabih, Y. A., Salau, A. O., Abuhayi, B. M., Mussa, A. A. & Ayalew, A. M. CPD-CCNN: Classification of pepper disease using a concatenation of convolutional neural network models. Sci. Rep. 13 , 15581 (2023).

Article   ADS   CAS   Google Scholar  

Kini, A. S., Prema, K. & Pai, S. N. Early stage black pepper leaf disease prediction based on transfer learning using convnets. Sci. Rep. 14 , 1404 (2024).

Wu, Q. et al. A classification method for soybean leaf diseases based on an improved convnext model. Sci. Rep. 13 , 19141 (2023).

Ma, X., Chen, W. & Xu, Y. ERCP-Net: A channel extension residual structure and adaptive channel attention mechanism for plant leaf disease classification network. Sci. Rep. 14 , 4221 (2024).

Babatunde, R. S. et al. A novel smartphone application for early detection of habanero disease. Sci. Rep. 14 , 1423 (2024).

Nagasubramanian, G. et al. Ensemble classification and IoT-based pattern recognition for crop disease monitoring system. IEEE Internet Things J. 8 , 12847–12854 (2021).

Nachtigall, L. G., Araujo, R. M. & Nachtigall, G. R. Classification of apple tree disorders using convolutional neural networks. In 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI) . 472–476 (IEEE, 2016).

Borhani, Y., Khoramdel, J. & Najafi, E. A deep learning based approach for automated plant disease classification using vision transformer. Sci. Rep. 12 , 11554 (2022).

Aishwarya, M. & Reddy, A. P. Dataset of groundnut plant leaf images for classification and detection. Data Brief 48 , 109185 (2023).

Shi, J. et al. Cervical cell classification with graph convolutional network. Comput. Methods Prog. Biomed. 198 , 105807 (2021).

Fahad, N. M., Azam, S., Montaha, S. & Mukta, M. S. H. Enhancing cervical cancer diagnosis with graph convolution network: AI-powered segmentation, feature analysis, and classification for early detection. Multimed. Tools Appl. 1–25 (2024).

Lanjewar, M. G., Panchbhai, K. G. & Patle, L. B. Fusion of transfer learning models with LSTM for detection of breast cancer using ultrasound images. Comput. Biol. Med. 169 , 107914 (2024).

Article   CAS   PubMed   Google Scholar  

He, K., Zhang, X., Ren, S. & Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37 , 1904–1916 (2015).

Article   PubMed   Google Scholar  

Szegedy, C. et al. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 1–9 (2015).

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2818–2826 (2016).

Chollet, F. Xception: Deep learning with depthwise separable convolutions. In IEEE Conference on Computer Vision Pattern Recognition . 1251–1258 (2017).

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition . 770–778 (2016).

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L.-C. MobileNetv2: Inverted residuals and linear bottlenecks. In Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition . 4510–4520 (2018).

Bera, A., Nasipuri, M., Krejcar, O. & Bhattacharjee, D. Fine-grained sports, yoga, and dance postures recognition: A benchmark analysis. IEEE Trans. Instrum. Meas. 72 , 1–13 (2023).

Bera, A., Wharton, Z., Liu, Y., Bessis, N. & Behera, A. Attend and guide (AG-Net): A keypoints-driven attention-based deep network for image recognition. IEEE Trans. Image Process. 30 , 3691–3704 (2021).

Article   ADS   PubMed   Google Scholar  

Singh, D. et al. PlantDoc: A dataset for visual plant disease detection. In Proceedings of the 7th ACM IKDD CoDS and 25th COMAD . 249–253 (ACM, 2020).

Hameed, Z., Garcia-Zapirain, B., Aguirre, J. J. & Isaza-Ruget, M. A. Multiclass classification of breast cancer histopathology images using multilevel features of deep convolutional neural network. Sci. Rep. 12 , 15600 (2022).

Shabrina, N. H. et al. A novel dataset of potato leaf disease in uncontrolled environment. Data Brief 52 , 109955 (2024).

Spanhol, F. A., Oliveira, L. S., Petitjean, C. & Heutte, L. A dataset for breast cancer histopathological image classification. IEEE Trans. Biomed. Eng. 63 , 1455–1462 (2015).

Plissiti, M. E. et al. SIPAKMED: A new dataset for feature and image based classification of normal and pathological cervical cells in Pap smear images. In 2018 25th IEEE International Conf. Image Processing (ICIP) . 3144–3148 (IEEE, 2018).

Van Der Maaten, L. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15 , 3221–3245 (2014).

MathSciNet   Google Scholar  

Selvaraju, R. R. et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In 2017 IEEE International Conference on Computer Vision (ICCV) . 618–626 (2017).

Han, K. A. M., Maneerat, N., Sepsirisuk, K. & Hamamoto, K. Banana plant nutrient deficiencies identification using deep learning. In 2023 9th International Conference on Engineering, Applied Sciences, and Technology (ICEAST) . 5–9 (IEEE, 2023).

Ahmad, A., El Gamal, A. & Saraswat, D. Toward generalization of deep learning-based plant disease identification under controlled and field conditions. IEEE Access 11 , 9042–9057 (2023).

Abdallah, N. et al. Enhancing histopathological image classification of invasive ductal carcinoma using hybrid harmonization techniques. Sci. Rep. 13 , 20014 (2023).

Han, Z. et al. Breast cancer multi-classification from histopathological images with structured deep learning model. Sci. Rep. 7 , 4172 (2017).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Basak, H., Kundu, R., Chakraborty, S. & Das, N. Cervical cytology classification using PCA and GWO enhanced deep features selection. SN Comput. Sci. 2 , 369 (2021).

Mohammed, M. A., Abdurahman, F. & Ayalew, Y. A. Single-cell conventional pap smear image classification using pre-trained deep neural network architectures. BMC Biomed. Eng. 3 , 11 (2021).

Yaman, O. & Tuncer, T. Exemplar pyramid deep feature extraction based cervical cancer image classification model using pap-smear images. Biomed. Signal Process. Control 73 , 103428 (2022).

Jiang, H. et al. Deep learning for computational cytology: A survey. Med. Image Anal. 84 , 102691 (2023).

Download references

Acknowledgements

This work is supported by the New Faculty Seed Grant (NFSG) and Cross-Disciplinary Research Framework (CDRF: C1/23/168) projects, Open Access facilities, and necessary computational infrastructure at the Birla Institute of Technology and Science (BITS) Pilani, Pilani Campus, Rajasthan, 333031, India.

Author information

Authors and affiliations.

Department of Computer Science and Information Systems, BITS Pilani, Pilani Campus, Pilani, Rajasthan, 333031, India

Department of Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, 700032, India

Debotosh Bhattacharjee

Faculty of Informatics and Management, University of Hradec Kralove, Hradec Kralove, Czech Republic

Debotosh Bhattacharjee & Ondrej Krejcar

Skoda Auto University, Na Karmeli 1457, 293 01, Mlada Boleslav, Czech Republic

Ondrej Krejcar

Malaysia Japan International Institute of Technology (MJIIT), Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia

You can also search for this author in PubMed   Google Scholar

Contributions

A. B. has played a pivotal role in this research. He has contributed in the model development, coding, generating results, and initial manuscript preparation. D.B. and O. K. both have contributed by reviewing the work meticulously, validated the model, and corrected the manuscript to enhance the clarity of the text description, and overall organization of the article. D.B. has carefully read the manuscript, and provided valuable inputs for improving the overall quality of the article.

Corresponding author

Correspondence to Asish Bera .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Bera, A., Bhattacharjee, D. & Krejcar, O. PND-Net: plant nutrition deficiency and disease classification using graph convolutional network. Sci Rep 14 , 15537 (2024). https://doi.org/10.1038/s41598-024-66543-7

Download citation

Received : 25 March 2024

Accepted : 02 July 2024

Published : 05 July 2024

DOI : https://doi.org/10.1038/s41598-024-66543-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Agriculture
  • Convolutional neural network
  • Graph convolutional network
  • Plant disease
  • Nutrition deficiency
  • Cancer classification
  • Spatial pyramid pooling

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

introduction to graphical representation of data

ORIGINAL RESEARCH article

Gihp: graph convolutional neural network based interpretable pan-specific hla-peptide binding affinity prediction.

Lingtao Su

  • 1 Shandong University of Science and Technology, Qingdao, China
  • 2 Shandong Guohe Industrial Technology Research Institute Co. Ltd., Jinan, China
  • 3 Qingdao UNIC Information Technology Co. Ltd., Qingdao, China

Accurately predicting the binding affinities between Human Leukocyte Antigen (HLA) molecules and peptides is a crucial step in understanding the adaptive immune response. This knowledge can have important implications for the development of effective vaccines and the design of targeted immunotherapies. Existing sequence-based methods are insufficient to capture the structure information. Besides, the current methods lack model interpretability, which hinder revealing the key binding amino acids between the two molecules. To address these limitations, we proposed an interpretable graph convolutional neural network (GCNN) based prediction method named GIHP. Considering the size differences between HLA and short peptides, GIHP represent HLA structure as amino acid-level graph while represent peptide SMILE string as atom-level graph. For interpretation, we design a novel visual explanation method, gradient weighted activation mapping (Grad-WAM), for identifying key binding residues. GIHP achieved better prediction accuracy than state-of-the-art methods across various datasets. According to current research findings, key HLA-peptide binding residues mutations directly impact immunotherapy efficacy. Therefore, we verified those highlighted key residues to see whether they can significantly distinguish immunotherapy patient groups. We have verified that the identified functional residues can successfully separate patient survival groups across breast, bladder, and pan-cancer datasets. Results demonstrate that GIHP improves the accuracy and interpretation capabilities of HLA-peptide prediction, and the findings of this study can be used to guide personalized cancer immunotherapy treatment. Codes and datasets are publicly accessible at: https://github.com/sdustSu/GIHP .

1 Introduction

HLA also known as MHC (major histocompatibility complex) molecules, are responsible for presenting peptides derived from intracellular or extracellular proteins to T cells. It is a crucial step in understanding and predicting immune responses, such as antigen presentation and T-cell activation ( Kallingal et al., 2023 ). HLA molecules are classified into two major classes: class I and class II. Each class has different subtypes, and their binding abilities vary depending on the specific HLA subtype. For HLA class I, the open binding groove close to both ends restrict the size of the bounded peptides between 8–12 residues, whereas HLA class II incorporates peptides of length 13–25 residues ( Wang and Claesson, 2014 ). As a results, existing methods can be classified into allele-specific and pan-specific methods. Allele-specific methods focus on predicting the binding affinity between a specific HLA allele. Pan-specific methods aim to predict HLA-peptide binding in a more general way, without the need for allele-specific training data. ( Gizinski et al., 2024 ).

Allele-specific methods train separate models for each MHC allele and make predictions for individual alleles. NetMHC ( Lundegaard et al., 2008 ) is a widely used allele-specific method, which utilize machine learning algorithm to learn the relationship between peptide sequences and their binding affinities to specific MHC alleles. NetMHC 4.0 ( Andreatta and Nielsen, 2016 ) is also a sequence-based allele-specific method, which uses both BLOSUM62 and sparse encoding schemes to encode the peptide sequences into nine amino acid-binding cores. In comparison with the HLA (around 360aa in length), peptides length are much shorter, and such methods must take insertion methods to reconcile or extend the original sequence. In addition, deep learning-based methods have also been developed for MHC-peptide binding prediction. DeepMHCII ( You et al., 2022 ), which utilizes deep convolutional neural networks (CNNs) to capture complex sequence patterns and interactions between peptide and MHC class II molecules. It takes the peptide and MHC protein sequences as input and uses multiple layers of convolutional filters to extract features from the sequences. These filters scan the input sequences at different lengths, capturing both local and global patterns. The extracted features are then fed into fully connected layers to make predictions of the binding affinity. MHCAttnNet ( Venkatesh et al., 2020 ) utilizes a combination of bidirectional long short-term memory (Bi-LSTM) and attention mechanisms to capture important features and dependencies in MHC- peptide interactions. The Bi-LSTM processes the sequences in both forward and backward directions, capturing the dependencies and context in the data. The attention mechanism allows the model weight different parts of the input sequences based on their relative importance. This enables the model to focus on the most relevant regions of the peptide and MHC sequences during the prediction process. SMM-align ( Nielsen et al., 2007 ) utilizes structural and sequence-based features to predict binding affinities for MHC class I alleles. It employs a PSSM alignment algorithm to align target peptide sequences with known binders and derive binding predictions. MHC-NP ( Giguere et al., 2013 ) also incorporate structure with sequence-based features and employs a random forest regression model to make predictions. Allele-specific methods are particularly useful when the focus is on specific alleles of interest, allowing for more accurate predictions tailored to those specific alleles. However, developing and maintaining separate models for each allele requires a significant amount of experimental binding data and computational resources.

On the other hand, pan-specific methods have the advantage of predicting binding affinities not only for alleles present in the training data but also for new, unseen alleles. NetMHCpan and NetMHCIIpan ( Reynisson et al., 2020 ) are widely used pan-specific methods. They take sequence feature as input, utilizes artificial neural networks (ANNs) to learn the relationship between peptide sequences and their binding affinities to MHCs. They consider various sequence-based features, including amino acid composition, physicochemical properties, and binding motifs. In comparison with these two methods, another pan-specific method MHCflurry ( O'Donnell et al., 2018 ; O'Donnell et al., 2020 ) integrates additional information, such as peptide processing predictions and binding affinity measurements from mass spectrometry-based experiments, to enhance its predictions. Some sequence-based methods, such as BERTMHC ( Cheng et al., 2021 ), leverage the power of the BERT language model to improve their performance. The BERT language model is pre-trained on a vast corpus of text data, which enables it to capture intricate patterns and dependencies within input sequences effectively. One of the advantages of using BERT for encoding peptide sequences is its ability to capture long-range dependencies and contextual information. This is particularly important in MHC binding prediction, where specific amino acid positions within a peptide can significantly affect the binding affinity. Because structure determines the function of proteins, therefore, some methods also incorporate structure information into their predictions. MixMHCpred-2.0.1 ( Gfeller et al., 2018 ) employs a deep learning architecture capable of learning complex patterns and relationships between peptide sequences and MHC binding affinities. The model is trained on a diverse set of MHC alleles and covers a wide range of peptide lengths. This allows it to make accurate predictions for a broad range of MHC-peptide combinations. NetMHCpan-4.0 ( Jurtz et al., 2017 ) utilizes a combination of structural and sequence-based features. It incorporates information from MHC-peptide complex structures and uses a machine learning approach to make pan-specific predictions. RPEMHC ( Wang et al., 2024 ) is a deep learning approach that aims to improve the prediction of MHC-peptide binding affinity by utilizing a residue-residue pair encoding scheme. In RPEMHC, the peptide sequence and MHC binding groove are encoded as one-hot vectors, representing each amino acid residue and its position. AutoDock is a widely used molecular docking software that can be employed for MHC-peptide binding prediction. It uses a Lamarckian genetic algorithm to explore the conformational space and predict the binding modes and affinities of peptides within the MHC binding groove. By modelling the docking between the HLA protein and peptide ligands these methods have achieved accurate binding prediction performance. However, docking methods rely on sampling different conformations of the peptide and MHC molecule to find the best binding pose. However, the conformational space of peptides and MHC molecules can be vast, and exhaustively sampling all possible conformations is computationally infeasible.

In fact, no matter allele-specific or pan-specific methods, they all can be broadly categorized into two main categories: sequence-based and structure-based methods. Sequence-based methods utilize machine learning techniques to capture the sequence motifs and physicochemical properties important for HLA-peptide binding. These methods employ various algorithms, such as support vector machines (SVMs), random forests, or ANNs, to learn the relationships between peptide sequences and binding affinities from large datasets. Sequence-based methods have the advantage of being computationally efficient and applicable to a wide range of HLA alleles and peptides. Structure-based methods leverage the three-dimensional structures of HLA molecules and peptides to predict binding affinities. Molecular docking algorithms, such as AutoDock, are commonly used to explore the conformational space and calculate binding energies. These methods require knowledge of the 3D structures of the HLA molecule and peptide, limiting their applicability to cases where experimental structures are unavailable. Recent advancements in deep learning, such as CNNs and recurrent neural networks (RNNs), have shown promise in HLA-peptide binding affinity prediction. Deep learning-based methods can effectively capture complex sequence patterns and structural features, leading to improved prediction accuracy ( Wang et al., 2023 ). These models often incorporate encoding schemes to represent peptide sequences or structural features and are trained on large datasets to learn the relationships between sequences and binding affinities. Despite notable progress, HLA-peptide binding affinity prediction still faces challenges and have some limitations. First, deep learning models are often considered as black boxes, meaning they lack interpretability. It can be challenging to understand the specific features or patterns that contribute to the model’s predictions. Interpretability is crucial in immunology research to gain insights into the molecular mechanisms underlying MHC-peptide interactions and to guide experimental studies; Second, existing methods often rely on sequence-based encoding schemes due to the limited availability of experimentally determined 3D structures for HLA-peptide complexes. While sequence information is informative, the exclusion of structural details may limit the accuracy and coverage of predictions, particularly for cases where structural features play a crucial role. Even some tools consider structure information, they seldom consider the structure features at the amino acids level. Besides, the length difference between the peptides that HLA can bind (typically around 8–15 amino acids) and the length of HLA molecules (which can be over 360 amino acids) poses a challenge in HLA-peptide binding affinity prediction. Furthermore, unlike HLAs, peptides are too short to form stable structures. All these drawbacks are not well solved by existing methods.

Considering all these limitations, we proposed GIHP, which is an interpretable GCNN-based algorithm for the prediction of peptides binding to pan HLA molecules. By representing peptide SMILE strings ( Quiros et al., 2018 ; Meng et al., 2024 ) and HLA structures as attributed graphs, GCNNs can effectively model the pairwise interactions between amino acids and capture both local and global structural features. Furthermore, GIHP has a novel visual explanation method called Grad-WAM for HLA-peptide binding affinity prediction and interpretation. By analyzing the learned representations and interactions within the graph structure, the Grad-WAM technique can identify the key residues that contribute most significantly to the HLA-peptide binding process. Comprehensive comparative evaluation results demonstrate that the GIHP achieves good performance across diverse benchmark datasets. By applying the GIHP framework to several cancer immunotherapy datasets, we have identified numerous promising biomarkers that can effectively distinguish patients with and without treatment response. Moving forward, the insights gained from the GIHP analysis can be leveraged to guide the development of more personalized cancer immunotherapy strategies.

2 Materials and methods

2.1 data collection and processing.

We collected human HLA-peptide interaction datasets from published papers or publicly available databases. ( Table 1 ).

www.frontiersin.org

Table 1 . Summary of the collected datasets after preprocessing.

Wang-2008 Dataset ( Wang et al., 2008 ): Experimentally measured peptide binding affinities for HLA class II molecules. The processed data set had 24,295 interaction entries in total with ligand length ranging from 16 to 37 and have 26 unique HLA molecules. HLA DP and DQ molecules are covered.

Wang-2010 Dataset ( Wang et al., 2010 ): Experimentally measured peptide binding affinities for MHC class II molecules. After preprocessing, the dataset contains 9,478 measured affinities and covers 14 MHC class II alleles with peptides length ranging from 9 to 37.

Kim-2014 Dataset ( Kim et al., 2014 ): this dataset was obtained from the Immune Epitope Database (IEDB) ( Vita et al., 2019 ), including binding affinity data compiled in 2009 (BD 2009), 2013 (BD 2013) and also include a blind datasets. Blind datasets refer to data resulting after subtracting BD2009 from BD 2013. For all these three datasets, only human datasets were kept for training. After preprocessing the dataset contains 268,189 interactions in total, with peptides length ranging from 8 to 30.

Jurtz-2017 Dataset ( Jurtz et al., 2017 ): this dataset is originally designed for training of NetMHCPan-4.0. The final processed dataset has 3,618,591 entries in total with ligand length ranging from 8 to 18.

Jensen-2018 Dataset ( Jensen et al., 2018 ): this dataset is used for training of NetMHCIIpan-3.2 ( Karosiene et al., 2013 ), which contains HLA class II binding affinities retrieved from the IEDB in 2016. The 2016 data set contains 131,008 data points, covering 36 HLA-DR, 27 HLA-DQ, 9 HLA-DP molecules and 15,965 unique peptides. The peptides length range from 9 to 33.

Zhao-2018 Dataset ( Zhao and Sher, 2018 ): this dataset is compiled for training IEDB tools as well as the MHCflurry ( O'Donnell et al., 2018 ). The dataset contains 21,092 binding relations, covering 18 HLA-DR, 19 HLA-DQ, 16 HLA-DP molecules and 2,168 unique peptides. The peptides length is 15.

Reynisson-2020 dataset ( Reynisson et al., 2020 ): this dataset is originally collected for training NetMHCpan-4.1 and NetMHCIIpan-4.0 methods. The dataset covering 161 distinct HLA class I molecules, 4,523,148 distinct peptides, with peptides length ranging from 8 to 15.

For all the collected training datasets, only binding affinity values in IC50nM format are kept, which are log-transformed to fall in the range between 0 and 1 by applying 1−log (IC50 nM)/log (50k) as explained by Nielsen et al. (2003) . When classifying the peptides into binders or non-binders a threshold of 500 nM is used. This means that peptides with log50k transformed binding affinity values greater than 0.426 are classified as binders. We consolidated all the collected datasets, removing any duplicate entries, to arrive at a final integrated dataset comprising 160,253 unique HLA-peptide interactions, covering 223 distinct HLA alleles and 35,481 peptide sequences. To further verify the generality of our method, we collected protein-peptide binding data from pepBDB ( Wen et al., 2019 ) database, after deleting peptides short than 8aa, we got 12,655 interactions between 11,055 proteins and 7,811 peptides. Because our method takes HLA and protein structure as input, all the structure data are downloaded from the PDB ( Berman et al., 2000 ) and AlphaFold database ( Varadi et al., 2022 ) and some are predicted by alphafold2 ( Jumper et al., 2021 ) and Rosettafold ( Baek et al., 2021 ). Only high-resolution experimental structures (e.g., X-ray crystallography or cryo-EM data with resolution better than 3.0 Å) were included. All structural models, whether experimental or predicted, were subjected to validation using atomic contact evaluation, and overall model quality assessment. Only structures that passed these validation checks were retained for further analyses.

To evaluate whether the key binding residues identified by our method can effectively differentiate patients who benefit from immunotherapy, we collected relevant breast, bladder, and pan-cancer treatment datasets from the cBioPortal resource ( Cerami et al., 2012 ), as shown in Table 2 . Key binding residues mutation could lead to binding affinity change between HLA and peptides. Binding affinity change has been demonstrated as a biomarker of immunotherapy efficiency ( Kim et al., 2020 ; Seidel et al., 2021 ; Murata et al., 2022 ). For each patient, only SNP mutations are kept, if the SNP locates on the key binding site of HLA or peptide, then we separate them in one group, otherwise in the other group. Then we conduct survival analysis for the two groups.

www.frontiersin.org

Table 2 . Immunotherapy related dataset and three cancer datasets.

Samstein-2019 dataset ( Samstein et al., 2019 ): The cohort consisted of 1,662 patients, received at least one dose of immune checkpoint inhibitor (ICI) therapy. The cohort encompassed a variety of cancer types with an adequate number of patients for analysis. In detail, 146 patients received anti-CTLA4, 1,447 received anti-PD1 or PD-L1, and 189 received both. This is a pan-cancer dataset, including 350 cases of non-small cell lung cancer (NSCLC), 321 cases of melanoma, 151 cases of renal cell carcinoma (RCC), 214 cases of bladder cancer, and 138 cases of head and neck squamous cell cancer.

Miao-2018 dataset ( Miao et al., 2018 ): this dataset consists of 249 patient tumors from six different cancer types: melanoma ( N = 151), non-small cell lung cancer ( N = 57), bladder cancer ( N = 27), head and neck squamous cell carcinoma ( N = 12), anal cancer ( N = 1), and sarcoma ( N = 1). These patients were treated with anti-PD-1 therapy ( N = 74), anti-PD-L1 therapy ( N = 20), anti-CTLA-4 therapy ( N = 145), or a combination of anti-CTLA-4 and anti-PD-1/L1 therapies ( N = 10). A small proportion of patients ( N = 7) received a combination of anti-PD-1, anti-PD-L1, or anti-CTLA-4 therapy with another immunotherapy, targeted therapy, or cytotoxic chemotherapy.

Razavi-2018 dataset ( Razavi et al., 2018 ): This dataset is downloaded from cBioPortal: https://cbioportal-datahub.s3.amazonaws.com/breast_msk_2018.tar.gz .

Clinton-2022 dataset ( Clinton et al., 2022 ): This dataset is downloaded from cBioPortal: https://cbioportal-datahub.s3.amazonaws.com/paired_bladder_2022.tar.gz .

Aaltonen-2020 dataset ( Consortium et al., 2020 ): This dataset is downloaded from cBioPortal: https://cbioportal-datahub.s3.amazonaws.com/pancan_pcawg_2020.tar.gz .

2.2 Methods

The overall framework of GIHP is illustrated in Figure 1 . GIHP takes HLA structure and peptide SMILE string as input. In the input representation module, HLA is represented as an attributed residue-level graph, while the peptide is represented as an attributed atom-level graph. Then a multi-layer GCNNs is used to learn the high-level features, and the learned features are contacted and fed into the MLP layer for final binding affinity prediction. To enhance the results interpretability, we introduced a novel visual interpretation method called Grad-WAM. Grad-WAM leverages gradient information from the last GCN layer to assess the significance of each neuron in determining affinity.

www.frontiersin.org

Figure 1 . The overall framework of GIHP.

2.2.1 Input representation

Graph-based protein structure representation has inherent advantages over traditional sequence-based approaches in capturing true binding events. For each HLA molecular, we take both structure and sequence information into consideration. Given one of our key objectives is to identify the critical binding amino acid residues, we have represented the HLA proteins as residue-level relational graphs G H = v , ε , where v is the set of amino acids, ε is the set of edges. As shown in Table 3 , we describe the node attributes by integrating sequence and structural property, including amino acid type, chemical properties, charges, etc. , while the edge attributes encompass connection types, distances, and structural information. We consider four types of bond edges including Peptide Bonds, Hydrogen Bonds, Ionic Bonds and Disulfide Bridges.

www.frontiersin.org

Table 3 . The node features of HLA graph.

Considering that the length of peptides binding to MHC class II is between 13–25 residues, and the length is around nine for peptides binding to MHC class I. Therefore, the peptide length is relatively short compared to HLAs (over 360aa). In this study, we represent peptides as SMILES-like sequences and then transform them into graphs using a molecular graph representation method inspired by RDKit ( https://www.rdkit.org ). The attributes of each node v i are shown in Table 4 . e i j ∈ ε is covalent bonds between the ith and the jth atoms. The edge attributes depending on the electrons shared between atoms, resulting in single, double, or triple bonds, respectively.

www.frontiersin.org

Table 4 . Node features of peptide graph.

2.2.2 Graph convolutional neural network module

Let A be the adjacency matrix, and X be the feature matrix of the given graph. Each GCN layer takes A and node embeddings as input and outputs the final embeddings. As shown in Eqs 1 , 2 .

Where, H is the embeddings, and H 0 = X , W l + 1 are trainable weight matrix, D ^ is the diagonal node degree matrix of A .

After obtaining the vector representations of HLA and peptide, they are concatenated and fed into a Multi-Layer Perceptron (MLP) to predict the binding affinity score. The MLP consists of three linear transformation layers, each followed by a Rectified Linear Unit (ReLU) activation function and a dropout layer with a dropout rate of 0.1, as in ( Öztürk et al., 2019 ). The Mean Squared Error (MSE) is employed as the loss function to measure the discrepancy between predicted and actual affinity scores. MSE is defined in Eq. 3 .

Where, n is the sample size, P i and Y i are the predictive and true values of the ith interaction pair, respectively.

2.2.3 Gradient-weighted activation mapping

While Grad-CAM has been successfully applied to various computer vision tasks, it is not directly applicable to graph-structured data. Therefore, in this paper we proposed a novel results interpretation methods called Grad-WAM, which can be used for identifying key binding related residues. Grad-WAM measure the contribution of each residue for the decision of binding by taking use of the gradient information in the last GCN layer. Grad-WAM utilizes a weighted combination of the positive partial derivatives of the feature maps with respect to the interaction values to generate the corresponding visual explanations. Considering the contribution of each residue is not equal, different from the explanation method proposed in MGraphDTA ( Yang et al., 2022 ), we introduce an additional weight ω (Eq. 4 ) gradient values.

Where, R e L U is the activation function, P is the predictive value as in Eq. 5 . T i is the feature value of the ith node on the feature map T of the last GCN layer. α i is the gradient value of the ith node defined in Eq. 6 . ∂ P ∂ T i is the partial derivative as in Eq. 7 .

In this way, the contribution of residues to the prediction of binding affinity is calculated. For visual explanation, residues are display utilizes colors, ranging from blue to red. A higher gradient value corresponds to a redder color, indicating the key role of that amino acid in the interaction.

3.1 Performance comparisons with other methods

Four widely used performance metrics were employed to measure methods’ performance. Including accuracy ( Acc ), Matthews Correlation Coefficient ( MCC ), sensitivity ( Sn ), and the specificity ( Sp ). The definitions of these four metrics are as follows: Eqs 8 – 11 .

Where, TP is True Positives, TN is True Negatives, FP is False Positives, and FN is False Negatives. In addition, by comparing the predicted and true values, predictions were assessed to be true or false. The receiver operating characteristic curves (ROC) were generated for all the methods, and the performance of each algorithm to discriminate between binders and nonbinders was analyzed by calculating the area under the ROC curve (AUC) as an estimate of prediction performance.

We compare GIHP with state-of-the-art allele and pan-specific baselines including NetMHC-4.0 ( Andreatta and Nielsen, 2016 ), NetMHCpan-4.0 ( Jurtz et al., 2017 ), PickPocket-1.1 ( Zhang et al., 2009 ), SMMPMBEC ( Kim et al., 2009 ), MHCFlurry ( O'Donnell et al., 2018 ), MixMHCpred-2.0 ( Bassani-Sternberg et al., 2017 ) and NetMHCcons-1.1 ( Karosiene et al., 2012 ). To eliminate the impact of data variations, all models were retrained and tested using our new collected and processed dataset. 10-fold cross-validation (CV) was applied. The data set is divided into 10 folds. During each iteration, one of the 10 partitions is designated as the validation dataset, while the remaining nine partitions are utilized to train the model. The final performance is determined by calculating the average performance across all 10 individual iterations. As shown in Figure 2 , on average, GIHP outperform all the compared prediction methods. It is worth noting that not every method is suitable for every HLA and peptide length. To make the performance comparison fairer and more reasonable, we train allele-specific models with their required HLAs and peptide length, which included in our datasets.

www.frontiersin.org

Figure 2 . Performance comparison results.

To make comparisons more comparable and test methods performance on other protein-peptide binding datasets, a separate independent test is conducted using the data collected from pepBDB, which have no overlap with the above training data. This independent test data set serves as an unbiased validation source to assess the performance of different tools, which is relatively more objective, and can test models’ generalization ability. 10-fold cross validation is applied, after each epoch average results are calculated. Results on the pepBDB independent test data is shown in Figure 3 .

www.frontiersin.org

Figure 3 . Independent test results on pepBDB datasets.

On average, GIHP achieved highest AUC value. In this independent test data, GIHP achieved the highest AUC of 0.88 and the highest Sp score of 0.98. In contrast, NetMHCPan-4.0 and Pickpocket-1.1 attained AUC values of 0.76 or lower, and Acc scores of 0.71 or lower when evaluated on this new dataset. Difference from the results on the above part, MHCflurry got AUC up to 0.8. Similar with our method, MHCflurry harness the power of deep learning and a comprehensive dataset to improve the prediction of HLA-peptide binding affinities. Our model outperforms both allele and pan-specific methods, demonstrate its ability to achieve higher prediction accuracy and robustness generality for all kinds of training data.

For evaluating the performance our method under different peptide length. We collected independent test set and external test set from TransPHLA, which can be downloaded from https://github.com/a96123155/TransPHLA-AOMP/ tree/master/Dataset. In the collected datasets, 9-mer peptides comprising the largest proportion, while the number of 13-mer and 14-mer peptides is very small. Our model’s performance on the independent test set and external test set for different peptide lengths are shown in Figures 4A, B respectively. As shown in Figure 4 , our methods can achieve good performance on all kinds of peptide length.

www.frontiersin.org

Figure 4 . The performance of our model on the independent test set and external test set for the different peptide lengths. (A) Performance on the independent test set. (B) Performance on the external tet set.

3.2 Key binding residues on HLAs

The binding of peptides to HLA molecules occurs within specialized regions called binding pockets. HLA class I molecules have a peptide-binding groove formed by two alpha helices (α1 and α2) and a beta sheet platform. Within this groove, there are seven pockets (numbered from A to F, shown in Figure 5A ) that interact with specific amino acid residues of the bound peptide. HLA class II molecules are involved in presenting peptides derived from extracellular proteins to helper T cells. HLA class II binding pockets are formed by two chains: the alpha chain (α) and the beta chain (β). Each chain consists of two domains: the α1 and β1 domains form the peptide-binding groove, while the α2 and β2 domains provide structural support. The binding groove of HLA class II molecules is open at both ends, allowing longer peptides to bind compared to HLA class I molecules. The binding pockets in HLA class II molecules are referred to as P1, P4, P6, P7, P9 ( Figure 5B ). With our GIHP results interpret module, many key binding residues on both HLA class molecules and the corresponding peptides are identified. Although some residues with high activity scores locates outside of binding pockets, most of them locates on one of the binding pockets. As shown in Figures 5C,D , 45 residues with highest activity scores on HLAs are identified, among them 26 locates on HLA class I pockets, and 19 locates on HLA class II pockets.

www.frontiersin.org

Figure 5 . The key binding residues on HLA pockets and HLA binding peptides motif. (A) Binding pockets on HLA class I molecules. (B) Binding pockets on HLA class II molecules. (C) The identified key binding residue locations and activity scores on each pocket of HLA class I molecules, where R represent residue location analyzed HLA molecules. (D) The identified key binding residue locations and activity scores on each pocket of HLA class II molecules. (E) Distribution of preferred peptide residues of HLA class I molecules using Seq2logo2.0. (F) Distribution of preferred peptide residues of HLA class I molecules using Seq2logo2.0.

Position 159 has the highest activity score on pockets A. Other positions including 59, 171, 167, seven and 66. According to current research, position seven is a pocket A’s floor residue. This residue creates a hydrophobic environment within the pocket A and interact with the side chain of the anchor residue. Although residue on position 159 has no evidence of directly involved in peptide binding interactions, it has structural and functional implications for the overall stability and conformation of the pocket A region ( Ma et al., 2020 ). It potentially contributes to the shape and electrostatic properties of the pocket, indirectly affecting the binding preferences and stability of the peptides presented by the HLA class I molecule. However, the specific role and impact of residue 159 on the pocket A’s function vary among different HLA alleles and need further study for a comprehensive understanding. On pockets B, substitutions at positions 70 was found to yield a significantly distinct peptide-binding repertoire in HLA-A molecules when compared to HLA-B molecules. Positions 167 and position 67 on pocket B has been demonstrated as key peptide-binding residues. Besides, substitutions at positions 67 and nine exert a significant influence on the peptide-binding repertoire ( van Deutekom and Keşmir, 2015 ). Position 97 has the highest activity score on pockets C. Position 97 is known to be a critical residue for peptide binding and presentation. This residue locates near the C-terminal anchor residue of the bound peptide and contributes to the formation of the peptide-binding groove. The amino acid at position 97 can significantly influence the peptide-binding specificity and affinity of the HLA molecule. Substitutions or variations at this position can alter the size, shape, or electrostatic properties of the pocket C, thereby affecting the recognition and binding of specific peptides. Several studies have investigated the impact of position 97 on peptide binding and immunological responses ( Moutaftsi et al., 2006 ).

Considering the residues with high activity scores on HLA class II pockets, position nine is crucial for determining the peptide-binding specificity of the HLA class II molecule. The amino acid at position nine of the bound peptide interacts with residues in the P1 pocket, influencing the peptide-binding preferences. Position 86 plays a critical role in peptide binding and presentation ( Brown et al., 1993 ). The amino acid at position 86 interacts with the peptide residue and contributes to the stability and specificity of the HLA-peptide class II complex ( Stern et al., 1994 ). Among our identified important positions, positions 13 and 74 are critical for determining the peptide-binding specificity and stability of HLA class II molecules. The interactions between peptide residues and the residues in these pockets are essential for the recognition and presentation of antigenic peptides to CD4 + T cells. Except these positions, we also prioritized many other residues, such as positions 63 and 57. These positions within the peptide-binding grooves of HLA class II molecules is crucial for understanding the molecular basis of antigen presentation and immune responses. Researchers can gain valuable information about the molecular interactions governing antigen presentation and T cell recognition. Furthermore, these results can help designing personalized immunotherapies ( Boukouaci et al., 2024 ).

Figures 5E, F show the motif analysis results. In the two figures, the Y-axis describes the amount of information in bits. The X-axis shows the position in the alignment. At each position there is a stack of symbols representing the amino acid. Large symbols represent frequently observed amino acids, big stacks represent conserved positions and small stacks represents variable positions. Therefore, positions 2, 4 and nine have frequently observed amino acids in HLA class I and class II respectively.

3.3 Key binding residues on peptides and their corresponding genes

In this paper, we focus on finding immunotherapy efficiency related key residues and their corresponding genes. With the identified residue positions and the corresponding gene mutation, we try to verify whether they can be biomarkers to separate patients into different survival groups. We applied GIHP to immunotherapy related datasets (Samstein-2019 and Miao-2018 in Table 2 ). For each SNP mutation site, we extract the corresponding 9-mer peptide around it and predict the binding affinities with all the 223 HLAs. By paired t -test statistical comparing the binding affinity change before and after residue substitution, along with GIHP returned activity scores of each residue, significant key binding residues are identified. To get the functions of these mutation related genes, we conducted GO enrichment analysis by ShinyGO-0.80 ( Ge et al., 2020 ). As shown in Figure 6 , most of key residues locate on genes related to pathways in cancer and cancer related signaling pathways.

www.frontiersin.org

Figure 6 . GO enrichment results of key residues related genes.

Since we interest in finding mutations related to immunotherapy response, therefore, we further analyzed key residues enriched in T cell receptor signaling pathway ( Figure 6 ). The enriched genes include RHOA, HLA-B, HRAS, IL10, NRAS and KRAS. RHOA has been implicated in T cell activation and migration, which are critical for effective anti-tumor immune responses ( Bros et al., 2019 ). Altered RHOA signaling could potentially impact T cell function and infiltration into the tumor microenvironment, influencing immunotherapy response. HLA-B plays a crucial role in immune recognition, as it presents peptide antigens derived from intracellular proteins to cytotoxic T cells. HRAS, NRAS, and KRAS are genes that belong to the RAS family of oncogenes. These genes encode proteins involved in intracellular signaling pathways regulating cell growth, survival, and proliferation. The presence of RAS mutations has been associated with poorer response rates to certain immunotherapies, including immune checkpoint inhibitors ( East et al., 2022 ). IL10 can suppress the activity of cytotoxic T cells and natural killer (NK) cells, which are critical for tumor surveillance and elimination. High levels of IL10 in the tumor microenvironment have been associated with immunosuppression and reduced response to immunotherapy ( Salkeni and Naing, 2023 ).

Next, we investigated the impact of biomarker gene mutations on patient survival outcomes using a cohort of individuals (Samstein-2019 dataset in Table 2 ) with immunotherapy treatment. The patients were categorized into two groups based on the presence or absence of the biomarker gene mutation. Kaplan-Meier survival curves were generated, and a log-rank test was performed to compare the survival between the two groups. The results revealed a significant difference in survival between the two groups, with patients harboring the biomarker gene mutation exhibiting a higher risk of adverse events compared to those without the mutation. These findings highlight the potential prognostic significance of the biomarker gene mutation and underscore its relevance in patient stratification and personalized treatment approaches. Furthermore, we compared our results with TMB score provided in ( Samstein et al., 2019 ). As shown in Figure 7 , patients with biomarker mutations tend to have poor survival status.

www.frontiersin.org

Figure 7 . Results on immunotherapy data. (A) patient groups separated by GIHP identified biomarker mutations. (B) TMB separated patient groups.

As shown in Figure 7 , our methods can separate patients more significantly. Although TMB can separate patients, TMB is an overall measure, its hard to know which gene mutations play key roles in differentiating patients’ response. Our methods not only can separate patients significantly, moreover, we also know which residue substitutions play key roles. To further test the performance of these biomarker genes, we analyzed Miao-2018 datasets ( Table 2 ), results is show in Figure 8 .

www.frontiersin.org

Figure 8 . Results on Miao-2018 datasets.

As illustrated in Figure 8 , the identified biomarker mutations are also able to effectively separate patient groups with statistical significance. Our findings provide compelling evidence that the identified biomarker genes may possess valuable predictive power for immunotherapy response and patient survival outcomes. This highlights their potential as clinically relevant targets for the development of personalized treatment approaches. The results of this study advance the understanding of the underlying molecular mechanisms governing immunotherapy efficacy, and offer promising directions for future research and therapeutic interventions.

3.4 Performance on other cancer datasets

In this section, we test whether these key residue mutations and their corresponding genes can separate other cancer patients. Results are shown in Figures 9A–C . Detail information of these three cancer datasets are shown in Table 2 . We can see that our biomarker genes can differentiate the three-cancer type significantly. Especially for the pan cancer datasets.

www.frontiersin.org

Figure 9 . Survival curves on breast, bladder and pan cancer datasets.

4 Conclusion

In summary, we proposed a new GCNN-based framework called GIHP for pan-specific HLA-peptide binding affinity prediction. GIHP harness both structure and sequence information and utilized Grad-WAM for visual interpretation. Extensive comparison with state-of-the-art methods verified the better performance of our methods. Collectively, the findings provide evidence that the GIHP framework has improved the generalization and interpretability capabilities of HLA-peptide binding prediction models. Furthermore, we have identified numerous key binding-related amino acid residues that can serve as potential biomarkers for differentiating patient groups based on immunotherapy response. When applying these identified biomarkers on datasets from other cancer types, they were also able to effectively differentiate patient groups with statistical significance. These findings highlight the potential prognostic significance of the biomarker gene mutation and underscore its relevance in patient stratification and personalized immunotherapy treatment approaches.

Data availability statement

The data presented in the study are deposited in the Github, accession link: https://github.com/sdustSu/GIHP .

Author contributions

LS: Funding acquisition, Methodology, Writing–original draft, Writing–review and editing. YY: Formal Analysis, Methodology, Validation, Visualization, Writing–review and editing. BM: Data curation, Formal Analysis, Investigation, Writing–review and editing. SZ: Formal Analysis, Methodology, Resources, Visualization, Writing–review and editing. ZC: Conceptualization, Project administration, Resources, Supervision, Writing–original draft, Writing–review and editing.

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This work is supported by Natural Science Foundation of Shandong Province (Youth Program, Grant No. ZR2022QF136), the Elite Program of Shandong University of Science and Technology and the National Science Foundation of China (Grant No. 62302277).

Conflict of interest

Author YY was employed by Shandong Guohe Industrial Technology Research Institute Co. Ltd. BM was employed by Qingdao UNIC Information Technology Co. Ltd.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Andreatta, M., and Nielsen, M. (2016). Gapped sequence alignment using artificial neural networks: application to the MHC class I system. Bioinformatics 32, 511–517. doi:10.1093/bioinformatics/btv639

PubMed Abstract | CrossRef Full Text | Google Scholar

Baek, M., DiMaio, F., Anishchenko, I., Dauparas, J., Ovchinnikov, S., Lee, G. R., et al. (2021). Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876. doi:10.1126/science.abj8754

Bassani-Sternberg, M., Chong, C., Guillaume, P., Solleder, M., Pak, H., Gannon, P. O., et al. (2017). Deciphering HLA-I motifs across HLA peptidomes improves neo-antigen predictions and identifies allostery regulating HLA specificity. PLoS Comput. Biol. 13, e1005725. doi:10.1371/journal.pcbi.1005725

Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., et al. (2000). The protein data bank. Nucleic Acids Res. 28, 235–242. doi:10.1093/nar/28.1.235

Boukouaci, W., Rivera-Franco, M. M., Volt, F., Lajnef, M., Wu, C. L., Rafii, H., et al. (2024). HLA peptide-binding pocket diversity modulates immunological complications after cord blood transplant in acute leukaemia. Br. J. Haematol. 204, 1920–1934. doi:10.1111/bjh.19339

Bros, M., Haas, K., Moll, L., and Grabbe, S. (2019). RhoA as a key regulator of innate and adaptive immunity. Cells 8, 733. doi:10.3390/cells8070733

Brown, J. H., Jardetzky, T. S., Gorga, J. C., Stern, L. J., Urban, R. G., Strominger, J. L., et al. (1993). Three-dimensional structure of the human class II histocompatibility antigen HLA-DR1. Nature 364, 33–39. doi:10.1038/364033a0

Cerami, E., Gao, J., Dogrusoz, U., Gross, B. E., Sumer, S. O., Aksoy, B. A., et al. (2012). The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2, 401–404. doi:10.1158/2159-8290.CD-12-0095

Cheng, J., Bendjama, K., Rittner, K., and Malone, B. (2021). BERTMHC: improved MHC-peptide class II interaction prediction with transformer and multiple instance learning. Bioinformatics 37, 4172–4179. doi:10.1093/bioinformatics/btab422

Clinton, T. N., Chen, Z., Wise, H., Lenis, A. T., Chavan, S., Donoghue, M. T. A., et al. (2022). Genomic heterogeneity as a barrier to precision oncology in urothelial cancer. Cell Rep. 41, 111859. doi:10.1016/j.celrep.2022.111859

Consortium, I. T. P., Abascal, F., Abeshouse, A., Aburatani, H., Adams, D. J., Agrawal, N., et al. (2020). Pan-cancer analysis of whole genomes. Nature 578, 82–93. doi:10.1038/s41586-020-1969-6

East, P., Kelly, G. P., Biswas, D., Marani, M., Hancock, D. C., Creasy, T., et al. (2022). RAS oncogenic activity predicts response to chemotherapy and outcome in lung adenocarcinoma. Nat. Commun. 13, 5632. doi:10.1038/s41467-022-33290-0

Ge, S. X., Jung, D., and Yao, R. (2020). ShinyGO: a graphical gene-set enrichment tool for animals and plants. Bioinformatics 36, 2628–2629. doi:10.1093/bioinformatics/btz931

Gfeller, D., Guillaume, P., Michaux, J., Pak, H. S., Daniel, R. T., Racle, J., et al. (2018). The length distribution and multiple specificity of naturally presented HLA-I ligands. J. Immunol. 201, 3705–3716. doi:10.4049/jimmunol.1800914

Giguere, S., Drouin, A., Lacoste, A., Marchand, M., Corbeil, J., and Laviolette, F. (2013). MHC-NP: predicting peptides naturally processed by the MHC. J. Immunol. Methods 400-401, 30–36. doi:10.1016/j.jim.2013.10.003

Gizinski, S., Preibisch, G., Kucharski, P., Tyrolski, M., Rembalski, M., Grzegorczyk, P., et al. (2024). Enhancing antigenic peptide discovery: improved MHC-I binding prediction and methodology. Methods 224, 1–9. doi:10.1016/j.ymeth.2024.01.016

Jensen, K. K., Andreatta, M., Marcatili, P., Buus, S., Greenbaum, J. A., Yan, Z., et al. (2018). Improved methods for predicting peptide binding affinity to MHC class II molecules. Immunology 154, 394–406. doi:10.1111/imm.12889

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589. doi:10.1038/s41586-021-03819-2

Jurtz, V., Paul, S., Andreatta, M., Marcatili, P., Peters, B., and Nielsen, M. (2017). NetMHCpan-4.0: improved peptide-MHC class I interaction predictions integrating eluted ligand and peptide binding affinity data. J. Immunol. 199, 3360–3368. doi:10.4049/jimmunol.1700893

Kallingal, A., Olszewski, M., Maciejewska, N., Brankiewicz, W., and Baginski, M. (2023). Cancer immune escape: the role of antigen presentation machinery. J. Cancer Res. Clin. Oncol. 149, 8131–8141. doi:10.1007/s00432-023-04737-8

Karosiene, E., Lundegaard, C., Lund, O., and Nielsen, M. (2012). NetMHCcons: a consensus method for the major histocompatibility complex class I predictions. Immunogenetics 64, 177–186. doi:10.1007/s00251-011-0579-8

Karosiene, E., Rasmussen, M., Blicher, T., Lund, O., Buus, S., and Nielsen, M. (2013). NetMHCIIpan-3.0, a common pan-specific MHC class II prediction method including all three human MHC class II isotypes, HLA-DR, HLA-DP and HLA-DQ. Immunogenetics 65, 711–724. doi:10.1007/s00251-013-0720-y

Kim, K., Kim, H. S., Kim, J. Y., Jung, H., Sun, J. M., Ahn, J. S., et al. (2020). Predicting clinical benefit of immunotherapy by antigenic or functional mutations affecting tumour immunogenicity. Nat. Commun. 11, 951. doi:10.1038/s41467-020-14562-z

Kim, Y., Sidney, J., Buus, S., Sette, A., Nielsen, M., and Peters, B. (2014). Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions. BMC Bioinforma. 15, 241. doi:10.1186/1471-2105-15-241

Kim, Y., Sidney, J., Pinilla, C., Sette, A., and Peters, B. (2009). Derivation of an amino acid similarity matrix for peptide: MHC binding and its application as a Bayesian prior. BMC Bioinforma. 10, 394. doi:10.1186/1471-2105-10-394

Lundegaard, C., Lamberth, K., Harndahl, M., Buus, S., Lund, O., and Nielsen, M. (2008). NetMHC-3.0: accurate web accessible predictions of human, mouse and monkey MHC class I affinities for peptides of length 8-11. Nucleic Acids Res. 36, W509–W512. doi:10.1093/nar/gkn202

Ma, L., Zhang, N., Qu, Z., Liang, R., Zhang, L., Zhang, B., et al. (2020). A glimpse of the peptide profile presentation by Xenopus laevis MHC class I: crystal structure of pXela-UAA reveals a distinct peptide-binding groove. J. Immunol. 204, 147–158. doi:10.4049/jimmunol.1900865

Meng, Z., Chen, C., Zhang, X., Zhao, W., and Cui, X. (2024). Exploring fragment adding strategies to enhance molecule pretraining in AI-driven drug discovery. Big Data Min. Anal. , 1–12. doi:10.26599/bdma.2024.9020003

CrossRef Full Text | Google Scholar

Miao, D., Margolis, C. A., Vokes, N. I., Liu, D., Taylor-Weiner, A., Wankowicz, S. M., et al. (2018). Genomic correlates of response to immune checkpoint blockade in microsatellite-stable solid tumors. Nat. Genet. 50, 1271–1281. doi:10.1038/s41588-018-0200-2

Moutaftsi, M., Peters, B., Pasquetto, V., Tscharke, D. C., Sidney, J., Bui, H. H., et al. (2006). A consensus epitope prediction approach identifies the breadth of murine T(CD8+)-cell responses to vaccinia virus. Nat. Biotechnol. 24, 817–819. doi:10.1038/nbt1215

Murata, K., Ly, D., Saijo, H., Matsunaga, Y., Sugata, K., Ihara, F., et al. (2022). Modification of the HLA-A*24:02 peptide binding pocket enhances cognate peptide-binding capacity and antigen-specific T cell activation. J. Immunol. 209, 1481–1491. doi:10.4049/jimmunol.2200305

Nielsen, M., Lundegaard, C., and Lund, O. (2007). Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method. BMC Bioinforma. 8, 238. doi:10.1186/1471-2105-8-238

Nielsen, M., Lundegaard, C., Worning, P., Lauemoller, S. L., Lamberth, K., Buus, S., et al. (2003). Reliable prediction of T-cell epitopes using neural networks with novel sequence representations. Protein Sci. 12, 1007–1017. doi:10.1110/ps.0239403

O’Donnell, T. J., Rubinsteyn, A., Bonsack, M., Riemer, A. B., Laserson, U., and Hammerbacher, J. (2018). MHCflurry: open-source class I MHC binding affinity prediction. Cell Syst. 7, 129–132. doi:10.1016/j.cels.2018.05.014

O’Donnell, T. J., Rubinsteyn, A., and Laserson, U. (2020). MHCflurry 2.0: improved pan-allele prediction of MHC class I-presented peptides by incorporating antigen processing. Cell Syst. 11, 418–419. doi:10.1016/j.cels.2020.09.001

Öztürk, H., Ozkirimli, E., and Özgür, A. (2019). WideDTA: prediction of drug-target binding affinity . arXiv:1902.04166.

Google Scholar

Quiros, M., Grazulis, S., Girdzijauskaite, S., Merkys, A., and Vaitkus, A. (2018). Using SMILES strings for the description of chemical connectivity in the Crystallography Open Database. J. Cheminform 10, 23. doi:10.1186/s13321-018-0279-6

Razavi, P., Chang, M. T., Xu, G., Bandlamudi, C., Ross, D. S., Vasan, N., et al. (2018). The genomic landscape of endocrine-resistant advanced breast cancers. Cancer Cell 34, 427–438. doi:10.1016/j.ccell.2018.08.008

Reynisson, B., Alvarez, B., Paul, S., Peters, B., and Nielsen, M. (2020). NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res. 48, W449–W454. doi:10.1093/nar/gkaa379

Salkeni, M. A., and Naing, A. (2023). Interleukin-10 in cancer immunotherapy: from bench to bedside. Trends Cancer 9, 716–725. doi:10.1016/j.trecan.2023.05.003

Samstein, R. M., Lee, C. H., Shoushtari, A. N., Hellmann, M. D., Shen, R., Janjigian, Y. Y., et al. (2019). Tumor mutational load predicts survival after immunotherapy across multiple cancer types. Nat. Genet. 51, 202–206. doi:10.1038/s41588-018-0312-8

Seidel, R. D., Merazga, Z., Thapa, D. R., Soriano, J., Spaulding, E., Vakkasoglu, A. S., et al. (2021). Peptide-HLA-based immunotherapeutics platforms for direct modulation of antigen-specific T cells. Sci. Rep. 11, 19220. doi:10.1038/s41598-021-98716-z

Stern, L. J., Brown, J. H., Jardetzky, T. S., Gorga, J. C., Urban, R. G., Strominger, J. L., et al. (1994). Crystal structure of the human class II MHC protein HLA-DR1 complexed with an influenza virus peptide. Nature 368, 215–221. doi:10.1038/368215a0

van Deutekom, H. W. M., and Keşmir, C. (2015). Zooming into the binding groove of HLA molecules: which positions and which substitutions change peptide binding most? Immunogenetics 67, 425–436. doi:10.1007/s00251-015-0849-y

Varadi, M., Anyango, S., Deshpande, M., Nair, S., Natassia, C., Yordanova, G., et al. (2022). AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444. doi:10.1093/nar/gkab1061

Venkatesh, G., Grover, A., Srinivasaraghavan, G., and Rao, S. (2020). MHCAttnNet: predicting MHC-peptide bindings for MHC alleles classes I and II using an attention-based deep neural model. Bioinformatics 36, i399–i406. doi:10.1093/bioinformatics/btaa479

Vita, R., Mahajan, S., Overton, J. A., Dhanda, S. K., Martini, S., Cantrell, J. R., et al. (2019). The immune epitope database (IEDB): 2018 update. Nucleic Acids Res. 47, D339–D343. doi:10.1093/nar/gky1006

Wang, M., and Claesson, M. H. (2014). Classification of human leukocyte antigen (HLA) supertypes. Methods Mol. Biol. 1184, 309–317. doi:10.1007/978-1-4939-1115-8_17

Wang, P., Sidney, J., Dow, C., Mothe, B., Sette, A., and Peters, B. (2008). A systematic assessment of MHC class II peptide binding predictions and evaluation of a consensus approach. PLoS Comput. Biol. 4, e1000048. doi:10.1371/journal.pcbi.1000048

Wang, P., Sidney, J., Kim, Y., Sette, A., Lund, O., Nielsen, M., et al. (2010). Peptide binding predictions for HLA DR, DP and DQ molecules. BMC Bioinforma. 11, 568. doi:10.1186/1471-2105-11-568

Wang, X., Wu, T., Jiang, Y., Chen, T., Pan, D., Jin, Z., et al. (2024). RPEMHC: improved prediction of MHC-peptide binding affinity by a deep learning approach based on residue-residue pair encoding. Bioinformatics 40, btad785. doi:10.1093/bioinformatics/btad785

Wang, Y., Jiao, Q., Wang, J., Cai, X., Zhao, W., and Cui, X. (2023). Prediction of protein-ligand binding affinity with deep learning. Comput. Struct. Biotechnol. J. 21, 5796–5806. doi:10.1016/j.csbj.2023.11.009

Wen, Z., He, J., Tao, H., and Huang, S. Y. (2019). PepBDB: a comprehensive structural database of biological peptide-protein interactions. Bioinformatics 35, 175–177. doi:10.1093/bioinformatics/bty579

Yang, Z., Zhong, W., Zhao, L., and Yu-Chian Chen, C. (2022). MGraphDTA: deep multiscale graph neural network for explainable drug-target binding affinity prediction. Chem. Sci. 13, 816–833. doi:10.1039/d1sc05180f

You, R., Qu, W., Mamitsuka, H., and Zhu, S. (2022). DeepMHCII: a novel binding core-aware deep interaction model for accurate MHC-II peptide binding affinity prediction. Bioinformatics 38, i220–i228. doi:10.1093/bioinformatics/btac225

Zhang, H., Lund, O., and Nielsen, M. (2009). The PickPocket method for predicting binding specificities for receptors based on receptor pocket similarities: application to MHC-peptide binding. Bioinformatics 25, 1293–1299. doi:10.1093/bioinformatics/btp137

Zhao, W., and Sher, X. (2018). Systematically benchmarking peptide-MHC binding predictors: from synthetic to naturally processed epitopes. PLoS Comput. Biol. 14, e1006457. doi:10.1371/journal.pcbi.1006457

Keywords: HLA-peptide binding, model interpretation, GCNN, immunotherapy, affinity prediction

Citation: Su L, Yan Y, Ma B, Zhao S and Cui Z (2024) GIHP: Graph convolutional neural network based interpretable pan-specific HLA-peptide binding affinity prediction. Front. Genet. 15:1405032. doi: 10.3389/fgene.2024.1405032

Received: 22 March 2024; Accepted: 20 June 2024; Published: 10 July 2024.

Reviewed by:

Copyright © 2024 Su, Yan, Ma, Zhao and Cui. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Zhenyu Cui, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

IMAGES

  1. Graphical Representation

    introduction to graphical representation of data

  2. Graphical Representation

    introduction to graphical representation of data

  3. Graphical Representation of Data

    introduction to graphical representation of data

  4. Introduction to Data Representation

    introduction to graphical representation of data

  5. Graphical representation of data mohit verma

    introduction to graphical representation of data

  6. Graphical Representation of Data in Statistics

    introduction to graphical representation of data

VIDEO

  1. Graphical representation of data: Diagram, Map and graph#gndu#semester2 #assessmentforlearning#best

  2. Graphical Representation

  3. Diagrammatic and Graphical Representation

  4. 01 Introduction to Data Visualization and Data Science

  5. INTRODUCTION TO DATA VISULALIZATION || PYTHON PROGRAMMING || LECTURE 04 BY DR NITIN SHARMA || AKGEC

  6. Graphical Representation of Data

COMMENTS

  1. Graphical Representation of Data

    Examples on Graphical Representation of Data. Example 1: A pie chart is divided into 3 parts with the angles measuring as 2x, 8x, and 10x respectively. Find the value of x in degrees. Solution: We know, the sum of all angles in a pie chart would give 360º as result. ⇒ 2x + 8x + 10x = 360º. ⇒ 20 x = 360º.

  2. Graphical Representation of Data

    Graphical Representation of Data: Graphical Representation of Data," where numbers and facts become lively pictures and colorful diagrams.Instead of staring at boring lists of numbers, we use fun charts, cool graphs, and interesting visuals to understand information better. In this exciting concept of data visualization, we'll learn about different kinds of graphs, charts, and pictures ...

  3. What Is Data Visualization? Definition & Examples

    Data visualization is the graphical representation of information and data. By using v isual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data. Additionally, it provides an excellent way for employees or business owners to present data to non ...

  4. Graphical Representation

    General Rules for Graphical Representation of Data. There are certain rules to effectively present the information in the graphical representation. They are: Suitable Title: Make sure that the appropriate title is given to the graph which indicates the subject of the presentation. Measurement Unit: Mention the measurement unit in the graph.

  5. Introduction to Graphs

    Introduction to Graphs-PDF. The graph is nothing but an organized representation of data. It helps us to understand the data. ... The main disadvantage of graphical representation of data is that it takes a lot of effort as well as resources to find the most appropriate data and then represents it graphically.

  6. 2: Graphical Representations of Data

    2.3: Histograms, Frequency Polygons, and Time Series Graphs. A histogram is a graphic version of a frequency distribution. The graph consists of bars of equal width drawn adjacent to each other. The horizontal scale represents classes of quantitative data values and the vertical scale represents frequencies. The heights of the bars correspond ...

  7. 2.1: Introduction

    Then patterns can more easily be discerned. Figure 2.1.1 2.1. 1: When you have large amounts of data, you will need to organize it in a way that makes sense. These ballots from an election are rolled together with similar ballots to keep them organized. (credit: William Greeson) In this chapter, you will study graphical ways to describe and ...

  8. Data Visualization: Definition, Benefits, and Examples

    Data visualization is the representation of information and data using charts, graphs, maps, and other visual tools. These visualizations allow us to easily understand any patterns, trends, or outliers in a data set. Data visualization also presents data to the general public or specific audiences without technical knowledge in an accessible ...

  9. Introduction to Data Visualization

    Data visualization is the graphical representation of data for understanding and communication. This encompasses two primary classes of visualization: Information Visualization - Visualization of data. This can either be: Exploratory: You are trying to explore and understand patterns and trends within your data. Explanatory: There is something in your data you would like to communicate to your ...

  10. Data representations

    A circle graph (or pie chart) is a circle that is divided into as many sections as there are categories of the qualitative variable. The area of each section represents, for each category, the value of the quantitative data as a fraction of the sum of values. The fractions sum to 1 ‍ . Sometimes the section labels include both the category ...

  11. What Is Graphical Representation Of Data

    Graphical representation of data, often referred to as graphical presentation or simply graphs which plays a crucial role in conveying information effectively. Principles of Graphical Representation. Effective graphical representation follows certain fundamental principles that ensure clarity, accuracy, and usability:Clarity : The primary goal ...

  12. Chapter 11 Data visualization principles

    Chapter 11. Data visualization principles. We have already provided some rules to follow as we created plots for our examples. Here, we aim to provide some general principles we can use as a guide for effective data visualization. Much of this section is based on a talk by Karl Broman 30 titled "Creating Effective Figures and Tables" 31 and ...

  13. Data and information visualization

    Data and information visualization ( data viz/vis or info viz/vis) [2] is the practice of designing and creating easy-to-communicate and easy-to-understand graphic or visual representations of a large amount [3] of complex quantitative and qualitative data and information with the help of static, dynamic or interactive visual items.

  14. What Is Data Visualization?

    Data visualization is the representation of data through use of common graphics, such as charts, plots, infographics and even animations. ... and hidden relationships within this unstructured data. Alternatively, they may utilize a graph structure to illustrate relationships between entities in a knowledge graph. There are a number of ways to ...

  15. What is Data Visualization and Why is It Important?

    Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data. This practice is crucial in the data science process, as it helps to make data more understandable ...

  16. PDF Chapter 9 Graphs: Definition, Applications, Representation

    Chapter 9. tion, Applications, Representation9.1 Graphs and RelationsGraphs (sometimes referred to as networks) offer a way of expressing relationships between pairs of items, and are. one of the mo. t important abstractions in c. mputer science.Question 9.1. What makes graphs so special?Wha.

  17. Graphical Representation: Types, Rules, Principles & Examples

    A graphical representation is the geometrical image of a set of data that preserves its characteristics and displays them at a glance. It is a mathematical picture of data points. It enables us to think about a statistical problem in visual terms. It is an effective tool for the preparation, understanding and interpretation of the collected data.

  18. 2.3: Other Graphical Representations of Data

    Table 2.3.1 2.3. 1: Data of Test Grades. Solution. Divide each number so that the tens digit is the stem and the ones digit is the leaf. 62 becomes 6|2. Make a vertical chart with the stems on the left of a vertical bar. Be sure to fill in any missing stems.

  19. Graphical Summaries

    Graphical summaries of data # Many powerful approaches to data analysis communicate their findings via graphs. These are an important counterpart to data analysis approaches that communicate their findings via numbers or tabless. Here we will illustrate some of the most common approaches for graphical data analysis. Throughout this discussion, it is important to remember that graphical data ...

  20. Introduction to Graph Representation Learning

    Learning over the whole graph is the most intuitive approach. We take a whole graph as input and generate a prediction based on it. It closely resembles standard machine learning regression and classification tasks. Lastly, for the community detection, we aim to identify dense clusters of vertices within the graph.

  21. Graph and its representations

    A Graph is a non-linear data structure consisting of vertices and edges. The vertices are sometimes also referred to as nodes and the edges are lines or arcs that connect any two nodes in the graph. More formally a Graph is composed of a set of vertices ( V ) and a set of edges ( E ). The graph is denoted by G (V, E).

  22. Perception-Inspired Graph Convolution for Music Understanding Tasks

    This blog post covers one of my recent papers in which I introduced a new graph convolutional block, called MusGConv, designed specifically for processing music score data. MusGConv takes advantage of music perceptual principles to improve the efficiency and the performance of graph convolution in Graph Neural Networks applied to music ...

  23. GUI

    This article concerns selected issues related to the representation of process information in graphical form to develop a comprehensive User Interface. It presents XAML Domain-Specific Language as a description of the user interface. It is a contribution to Programming in Practice External Data topics. A sample program backs all topics. Preface

  24. [2407.04206] Computational Graph Representation of Equations System

    Equations system constructors of hierarchical circuits play a central role in device modeling, nonlinear equations solving, and circuit design automation. However, existing constructors present limitations in applications to different extents. For example, the costs of developing and reusing device models -- especially coarse-grained equivalent models of circuit modules -- remain high while ...

  25. Integrated framework for geological modeling: integration of data

    Three-dimensional (3D) geological modeling from limited and scattered information is essential for engineering geological investigation and design. Previous studies have encountered limitations when using a single modeling approach in complex tasks involving diverse geological structures, due to difficulties in accommodating the heterogeneity of geological structures and data imbalances. In ...

  26. Introduction to Graph Data Structure

    Representation of Graph Data Structure: There are two ways to store a graph: Adjacency Matrix; Adjacency List; Adjacency Matrix Representation of Graph Data Structure: In this method, the graph is stored in the form of the 2D matrix where rows and columns denote vertices. Each entry in the matrix represents the weight of the edge between those ...

  27. [2407.06000] Bounding Boxes and Probabilistic Graphical Models: Video

    In this study, we formulate the task of Video Anomaly Detection as a probabilistic analysis of object bounding boxes. We hypothesize that the representation of objects via their bounding boxes only, can be sufficient to successfully identify anomalous events in a scene. The implied value of this approach is increased object anonymization, faster model training and fewer computational resources ...

  28. Enhancement of patient's health prediction system in a graphical

    The patient health prediction system is the most critical study in medical research. Several prediction models exist to predict the patient's health condition. However, a relevant result was not attained because of poor quality. The IoT-sensed data contains more noise content, which maximizes the complexity of the health prediction. These demerits resulted in low prediction and performance ...

  29. PND-Net: plant nutrition deficiency and disease classification using

    Over time, several variations of GCNs have been developed for graph structured data 21. Furthermore, GCN is effective for message propagation for image and video data in various applications.

  30. Frontiers

    Furthermore, GIHP has a novel visual explanation method called Grad-WAM for HLA-peptide binding affinity prediction and interpretation. By analyzing the learned representations and interactions within the graph structure, the Grad-WAM technique can identify the key residues that contribute most significantly to the HLA-peptide binding process.