Among the strongest features of Python is its endless capabilities of presenting data visually made possible by simple-to-use graph tools. The traditional matplotlib package is usually the first for the beginner Python programmer to learn. Matplotlib is also employed by the pandas package which is the de-facto tool used in data science. Another package called seaborn takes the plotting capabilities of matplotlib to the next level. They look particularly good when used in Python notebooks. There are alternative graph tools for Python, such as the Plotly, Bokeh and Altair packages.
Let’s start by downloading PyScripter – a great GUI for Python coding.
Table of Contents
Why use Python graph tools?
Python is the programming language of choice for data science and machine learning. Data often comes from noisy sources, there are missing or incorrect values. In other cases, it is necessary to have an initial overview of the data before proceeding with its further processing. Data visualization is an essential first step in choosing appropriate methods of numerical analysis.
What kind of plots can one create using Python graph tools?
- Line plot
- Scatter plot
- Histogram
- Pie chart
- Heat map
- Contour plot
- 3D plot
- Animations
What is the most common Python graph tool?
Matplotlib is the traditional plotting package used in hundreds of thousands of Python packages.
Getting started
I shall start with the very basics. We need two lists of points – for the x- and the y-axis of a scatter plot.
1 2 3 4 5 6 7 8 |
import numpy as np import matplotlib.pyplot as plt X = [1,2,3,4,5] Y = [6,7,8,9,10] plt.plot(X,Y) plt.show() |
Plotting functions using the matplotlib graph tool for Python and NumPy
Let’s create a simple data set using the NumPy package and plot simple mathematical functions. We need a set of equally spaced points for the x-axis. This can be done using the arange function from NumPy. On the y-axis, we will plot some lines and curves. By the way – that IS the correct spelling of arange – think of it as “a range” not “arrange”.
I will define lambda functions. They are often considered advanced Python concepts, but they actually are very simple. A lambda function is a concise way of writing a function on one line. For example, I define the function multiply which takes two numbers a and b, and returns their product, a*b. Then I set the Y values that we want to plot by calling the multiply function and giving as parameters the list of points on the x-axis and a number.
I will define two lines, one for which every value of y is the same as x, and another one where y is twice as big as x.
1 2 3 4 5 6 7 8 9 10 11 |
import numpy as np import matplotlib.pyplot as plt X = np.arange(0,100) multiply = lambda a,b: a*b Y1 = multiply(X,1) Y2 = multiply(X,2) plt.plot(X,Y1) plt.plot(X,Y2) plt.show() |
We can plot curves, as well. A simple parabola is defined by x2, so I can again use the multiply function to create the points.
1 |
Y3 = multiply(X,X) |
How to change colours and line styles
Python automatically assigns different colours to lines, however, sometimes it is useful to change them. This can be done by specifying the name of the colour. A full list of colours is available online by searching for CSS colours.
1 2 3 |
plt.plot(X,Y1,color='lightgreen') plt.plot(X,Y2,color='royalblue') plt.plot(X,Y3,color='mediumorchid') |
Lines can be plotted in different styles such as dotted or dashed. This is set by the parameter linestyle.
1 2 3 |
plt.plot(X,Y1,linestyle='dashed') plt.plot(X,Y2,linestyle='dotted') plt.plot(X,Y3,linestyle='dashdot') |
There is a shorter way of defining line type and/or colour. For example, a red dashed line is defined by
1 |
plt.plot(X,Y1,'r--') |
A blue dotted line can be made using
1 |
plt.plot(X,Y2,'b:') |
And a green dot-dash line is simply
1 |
plt.plot(X,Y3,'g-.') |
The points can be individually drawn as markers. This is similar to the syntax above. There are many marker styles – points, squares, triangles and more.
1 2 3 |
plt.plot(X,Y1,'r.') plt.plot(X,Y2,'bo') plt.plot(X,Y3,'g^') |
Line thickness can be adjusted using the linewidth property.
1 2 3 |
plt.plot(X,Y1,linewidth=1) plt.plot(X,Y2,linewidth=2) plt.plot(X,Y3,linewidth=3) |
How to set labels?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
import numpy as np import matplotlib.pyplot as plt X = np.arange(0,20) multiply = lambda a,b: a*b Y1 = multiply(X,1) Y2 = multiply(X,2) Y3 = multiply(X,X) plt.plot(X,Y1,label='Y = X') plt.plot(X,Y2,label = 'Y = 2*X') plt.plot(X,Y3,label = 'Y = X$^2$') plt.legend() plt.title('Plotting simple functions') plt.xlabel('x') plt.ylabel('y') plt.show() |
How to best plot large amounts of data?
Large amounts of data are usually processed using the pandas package. It utilises matplotlib to make some basic plots of the data, including line plots, histograms, and bar graphs. Advanced examples are available in the documentation of pandas.
I have downloaded a small part of the famous iris data set in CSV format. It contains various data about a large number of properties of three species of iris flowers, including sepal length and width and petal length and petal width.
Loading data with pandas
The code below loads the data about the iris flowers into a data frame. It automatically assigns the names of the columns as given in the CSV file. Pandas employs the plotting methods from matplotlib to easily create graphs. I will plot the sepal length and width as a histogram.
1 2 3 4 5 6 7 8 9 10 11 12 |
import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv("iris.csv") print(data) data["sepal_length"].plot(kind = "hist", label = "sepal_length") data["sepal_width"].plot(kind = "hist", label = "sepal_width") plt.legend() plt.show() |
The output looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa .. ... ... ... ... ... 145 6.7 3.0 5.2 2.3 virginica 146 6.3 2.5 5.0 1.9 virginica 147 6.5 3.0 5.2 2.0 virginica 148 6.2 3.4 5.4 2.3 virginica 149 5.9 3.0 5.1 1.8 virginica [150 rows x 5 columns] |
Pandas is rich in features and it would take several more articles to go into all possible details. More information and examples can be found in the documentation of pandas.
How to plot multiple plots at once?
While matplotlib is the traditional Python graph tool, but there are plenty of parameters to set. Pandas significantly simplifies the process.
1 2 3 4 5 6 7 8 9 |
import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv("iris.csv") data.plot(subplots=True) plt.legend() plt.show() |
How to make plots even more pleasing?
The pandas graphs are quite okay and tell clearly the distribution of the data, however, there is room for improvement. The seaborn package brings in significant improvements.
The data distribution can be visualised as a histogram using the following code.
1 2 3 4 5 6 7 8 9 10 |
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt iris = pd.read_csv("iris.csv") sns.set_style("whitegrid") sns.distplot(iris['sepal_length']) plt.show() |
A more detailed plot of the correlation between the data is also very simple to do.
1 2 3 4 5 6 7 8 |
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt iris = pd.read_csv("iris.csv") sns.pairplot(data = iris) plt.show() |
What alternatives are there to matplotlib?
What is Plotly like?
Plotly was originally written for Javascript. It can make graphs, as well as interactive plots. It has an amazing set of features, including plotting data on geographical maps. The code below creates a local server and opens the view in the browser where the user can rotate the globe and hover with the mouse for more details. This Python graph tool is very handy, and it is not necessary to know any Javascript at all.
1 2 3 4 5 6 7 |
import plotly.express as px df = px.data.gapminder().query("year==2007") fig = px.scatter_geo(df, locations="iso_alpha", color="continent", hover_name="country", size="pop", projection="orthographic") fig.show() |
What are the benefits of Bokeh?
Bokeh is a professional tool for advanced users. It also produces interactive plots. There is a detailed tutorial on their webpage, creating Jupyter notebooks to test out. It is definitely worth trying.
How about Altair?
Altair is user-friendly, yet very powerful. Below I am using again the iris data set.
Internally, Altair prepares a json-style string that defines the plot. It can save the output as an HTML file with the graph embedded.
1 2 3 4 5 6 7 8 9 10 11 12 |
import pandas as pd import altair as alt import matplotlib.pyplot as plt data = pd.read_csv("iris.csv") chart = alt.Chart(data).mark_bar().encode( x='petal_length', y='petal_width', ) chart.save('chart.html') |
Conclusion
There are many graph tools for Python. They offer a wide variety of tools and settings for the plots. It is mostly up to personal preference which one you will like best. It is excellent for the beginner Python coder.