Dataset sampling

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

custom_style = {"grid.color": "black", "grid.linestyle": ":", "grid.linewidth": 0.3, "axes.edgecolor": "black", "ytick.left": True, "xtick.bottom": True}
sns.set_context("notebook")
sns.set_theme(style="whitegrid", rc=custom_style)

# Read the dataset
df = pd.read_csv("../data/global-data-on-sustainable-energy.csv")

Statistical Relationship

1. Feature Correlation

Correlation coefficients are indicators of the strength of the linear relationship between two different variables. A bigger circle means a higher correlation. The color of the circle indicates the sign of the correlation. A negative correlation (indicated by a blue color) means that the two variables move in opposite directions (when a variable is increasing, the other is decreasing). A positive correlation (indicated by a red color) means that the two variables move in the same direction (when a variable is increasing the other is also increasing).

plt.figure(figsize=(20,20))

corr_matrix = df.iloc[:, 2:21].corr()

sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5)
plt.title("Correlation matrix of the dataset", fontsize=20, fontweight="bold")
plt.savefig("../img/correlation_matrix.png", dpi=200, bbox_inches="tight")

We can clearly see that some features are highly corrolated to each other but other are not. For instance the Financial flow to developing countries (US$) has only poor correlation with all other features. It would be hard to predict this feature based on the other ones. This is the same for the gdp_growth, Longitude, Energy intensity level of primary energy features. It could have been interesting to work with these features but we will not use them for the project.

An interesting thing to note is that the Latitude feature is quite positively well corrolated with the Access to electricity and Access to clean fuels for cooking features. This means that the more you go to the north, the more you have access to electricity and clean fuels for cooking. This is not surprising since the north is more developed than the south (look also for the gdp_per_capita correlation score). This is also the case for the Primary energy consumption par capita feature. Northern countries consume more energy than southern ones. They tend to have a higher impact on climate change than southern countries.

Now, we can set an objective for the project. We would like to predict:

The CO2 emissions
Primary energy consumption per capita

These are mainly regression problems.

Predicting the percentage of Renewables energy as primary energy could have been also interesting to do but the response feature is so poorly filled that we can’t use it as a response feature (cf the dataset description section).

2. Feature selection

Selection for the CO2 emissions prediction

Looking at the heatmap, we can clearly see that a linear relationship exists between some factors and the response (Value_co2_emissions_kt_by_country). Taking a threshold of 0.5 for the correlation score will make us takinng important feature to compute a regression after that.

co2_cols_to_keep = [column for column in corr_matrix.columns if abs(corr_matrix.loc['Value_co2_emissions_kt_by_country', column]) > 0.5]
co2_cols_to_keep

['Electricity from fossil fuels (TWh)',
 'Electricity from nuclear (TWh)',
 'Electricity from renewables (TWh)',
 'Value_co2_emissions_kt_by_country',
 'Land Area(Km2)']

df_co2 = df[['Entity', 'Year'] + co2_cols_to_keep]
df_co2.to_csv("../data/co2/co2_dataset.csv", index=False)
df_co2

	Entity	Year	Electricity from fossil fuels (TWh)	Electricity from nuclear (TWh)	Electricity from renewables (TWh)	Value_co2_emissions_kt_by_country	Land Area(Km2)
0	Afghanistan	2000	0.16	0.0	0.31	760.000000	652230.0
1	Afghanistan	2001	0.09	0.0	0.50	730.000000	652230.0
2	Afghanistan	2002	0.13	0.0	0.56	1029.999971	652230.0
3	Afghanistan	2003	0.31	0.0	0.63	1220.000029	652230.0
4	Afghanistan	2004	0.33	0.0	0.56	1029.999971	652230.0
...	...	...	...	...	...	...	...
3644	Zimbabwe	2016	3.50	0.0	3.32	11020.000460	390757.0
3645	Zimbabwe	2017	3.05	0.0	4.30	10340.000150	390757.0
3646	Zimbabwe	2018	3.73	0.0	5.46	12380.000110	390757.0
3647	Zimbabwe	2019	3.66	0.0	4.58	11760.000230	390757.0
3648	Zimbabwe	2020	3.40	0.0	4.19	NaN	390757.0

3649 rows × 7 columns

For this dataset, we will have:

Features	Factors	Response
Year
Electricity from fossil fuels (TWh)	X
Electricity from nuclear (TWh)	X
Electricity from renewables (TWh)	X
Land Area (Km2)	X
Value_co2_emissions_kt_by_country		X

Remember also that according to the dataset description, the Value_co2_emissions_kt_by_country has some missing values that we will have to deal with (drop rows or impute values).

Selection for the Primary energy consumption per capita

pe_cols_to_keep = [column for column in corr_matrix.columns if abs(corr_matrix.loc['Primary energy consumption per capita (kWh/person)', column]) > 0.4]
pe_cols_to_keep

['Access to electricity (% of population)',
 'Access to clean fuels for cooking',
 'Renewable energy share in the total final energy consumption (%)',
 'Primary energy consumption per capita (kWh/person)',
 'gdp_per_capita']

df_pe = df[['Entity', 'Year'] + pe_cols_to_keep]
df_pe.to_csv("../data/pe/pe_dataset.csv", index=False)
df_pe

	Entity	Year	Access to electricity (% of population)	Access to clean fuels for cooking	Renewable energy share in the total final energy consumption (%)	Primary energy consumption per capita (kWh/person)	gdp_per_capita
0	Afghanistan	2000	1.613591	6.2	44.99	302.59482	NaN
1	Afghanistan	2001	4.074574	7.2	45.60	236.89185	NaN
2	Afghanistan	2002	9.409158	8.2	37.83	210.86215	179.426579
3	Afghanistan	2003	14.738506	9.5	36.66	229.96822	190.683814
4	Afghanistan	2004	20.064968	10.9	44.24	204.23125	211.382074
...	...	...	...	...	...	...	...
3644	Zimbabwe	2016	42.561730	29.8	81.90	3227.68020	1464.588957
3645	Zimbabwe	2017	44.178635	29.8	82.46	3068.01150	1235.189032
3646	Zimbabwe	2018	45.572647	29.9	80.23	3441.98580	1254.642265
3647	Zimbabwe	2019	46.781475	30.1	81.50	3003.65530	1316.740657
3648	Zimbabwe	2020	52.747670	30.4	81.90	2680.13180	1214.509820

3649 rows × 7 columns

For this dataset, we will have:

Features	Factors	Response
Year
Access to electricity (% of population)	X
Access to clean fuels for cooking	X
Renewable energy share in the total final energy consumption (%)	X
Primary energy consumption per capita (kWh/person)		X
GDP per capita	X

3. Higher dimension relationship

CO2 emissions prediction

# This prints the number of rows with at least one missing value
df_co2.shape[0] - df_co2.dropna().shape[0]

This dataset does not contain a lot of missing values. If we drop rows that are containing we still have more than 3000 samples to train our model. As we have 5 features, this number if samples is sufficient to train a regression model. There is no need of finding higher dimensional relationship.

Renewable energy consumption prediction

df_pe.shape[0] - df_pe.dropna().shape[0]

Same result are the CO2 emission dataset: even though we drop all the rows contaning missing values, we still have around 3000 samples to train our model. This is sufficient for a regression model with 5 features.

Feature Sampling

To have a better distribution, we’ll normalize our data.

def normalizeDataFrame(df):
    return (df - df.min()) / (df.max() - df.min())

def saveDataset(df_to_save, original_df, path):
    df_to_save.insert(0, "Entity", original_df["Entity"])
    df_to_save.insert(1, "Year", original_df["Year"])
    df_to_save.to_csv(path, index=False)

CO2 emissions prediction

normalized_df_co2 = normalizeDataFrame(df_co2.iloc[:, 2:])

normalized_df_co2_to_save = normalized_df_co2.copy()
saveDataset(normalized_df_co2_to_save, df_co2, "../data/co2/normalized_co2_dataset.csv")

normalized_df_co2

	Electricity from fossil fuels (TWh)	Electricity from nuclear (TWh)	Electricity from renewables (TWh)	Value_co2_emissions_kt_by_country	Land Area(Km2)
0	0.000031	0.0	0.000142	0.000070	0.065321
1	0.000017	0.0	0.000229	0.000067	0.065321
2	0.000025	0.0	0.000256	0.000095	0.065321
3	0.000060	0.0	0.000288	0.000113	0.065321
4	0.000064	0.0	0.000256	0.000095	0.065321
...	...	...	...	...	...
3644	0.000675	0.0	0.001519	0.001028	0.039134
3645	0.000588	0.0	0.001968	0.000965	0.039134
3646	0.000720	0.0	0.002499	0.001155	0.039134
3647	0.000706	0.0	0.002096	0.001097	0.039134
3648	0.000656	0.0	0.001918	NaN	0.039134

3649 rows × 5 columns

sns.pairplot(normalized_df_co2, height=4, kind="scatter", diag_kind="kde", diag_kws={"linewidth": 1.5, "color": "red"})

We can clearly see correlations on scatter plot between features. For feature distribution, is hard to see because countries that are emitting a lot of CO2 are so few compared to those who are not. Computing logarithmic values of the dataset could be a possibility but many samples have 0 as value for the Electricity from nuclear (TWh) feature. As log(0) is not defined, we can’t use this method. Let’s try to remove some outliers.

top_outliers_co2 = df_co2['Value_co2_emissions_kt_by_country'].quantile(0.95)

normalized_df_co2_2 = normalizeDataFrame(df_co2[df_co2['Value_co2_emissions_kt_by_country'] < top_outliers_co2].iloc[:, 2:])
normalized_df_co2_2_to_save = normalized_df_co2_2.copy()
saveDataset(normalized_df_co2_2_to_save, df_co2[df_co2['Value_co2_emissions_kt_by_country'] < top_outliers_co2], "../data/co2/normalized_co2_dataset_without_outliers.csv")

print(normalized_df_co2_2.shape)

sns.pairplot(normalized_df_co2_2, height=4, kind="scatter", diag_kind="kde", diag_kws={"linewidth": 1.5, "color": "red"})
plt.savefig("../img/pairplot_co2.png", dpi=200, bbox_inches="tight")

(3059, 5)

This is better ! We still have enough samples to train our model and it is a bit easier to see that our data are quite randomly sampled. We have continuous samples and it might seems that the response feature is following a Weibull or Log-normal distribution.

There is also sufficient variation across all features to support statistical modeling.

Primary energy consumption per capita

normalized_df_pe = normalizeDataFrame(df_pe.iloc[:, 2:])

normalized_df_pe_to_save = normalized_df_pe.copy()
saveDataset(normalized_df_pe_to_save, df_pe, "../data/pe/normalized_pe_dataset.csv")

normalized_df_pe

	Access to electricity (% of population)	Access to clean fuels for cooking	Renewable energy share in the total final energy consumption (%)	Primary energy consumption per capita (kWh/person)	gdp_per_capita
0	0.003659	0.062	0.468451	0.001152	NaN
1	0.028581	0.072	0.474802	0.000902	NaN
2	0.082603	0.082	0.393898	0.000803	0.000547
3	0.136573	0.095	0.381716	0.000876	0.000638
4	0.190513	0.109	0.460641	0.000778	0.000806
...	...	...	...	...	...
3644	0.418333	0.298	0.852770	0.012292	0.010961
3645	0.434707	0.298	0.858601	0.011684	0.009102
3646	0.448824	0.299	0.835381	0.013108	0.009260
3647	0.461066	0.301	0.848605	0.011439	0.009763
3648	0.521484	0.304	0.852770	0.010207	0.008935

3649 rows × 5 columns

sns.pairplot(normalized_df_pe, height=4, kind="scatter", diag_kind="kde", diag_kws={"linewidth": 1.5, "color": "red"})

For this problem, it is easier to see that all the input features are well distributed: our data are more randomly sampled than for the CO2 problem. However, it is a bit more difficult to see that for the response feature. We can try to remove some outliers to have a better distribution.

top_outliers_pe = df_pe['Primary energy consumption per capita (kWh/person)'].quantile(0.95)

normalized_df_pe_2 = normalizeDataFrame(df_pe[df_pe['Primary energy consumption per capita (kWh/person)'] < top_outliers_pe].iloc[:, 2:])
normalized_df_pe_2_to_save = normalized_df_pe_2.copy()
saveDataset(normalized_df_pe_2_to_save, df_pe[df_pe['Primary energy consumption per capita (kWh/person)'] < top_outliers_pe], "../data/pe/normalized_pe_dataset_without_outliers.csv")

print(normalized_df_pe_2.shape)

sns.pairplot(normalized_df_pe_2, height=4, kind="scatter", diag_kind="kde", diag_kws={"linewidth": 1.5, "color": "red"})
plt.savefig("../img/pairplot_pe.png", dpi=200, bbox_inches="tight")

(3466, 5)

Only by removing 5% of the outliers, we can see that the distribution of the continuous response feature is better ! Moreover, the shape of this distribution looks like a Log-normal distribution.

There is also more variation across all these features to support statistical modeling than for the CO2 emissions prediction problem, which is good.

Next Steps section >>>