Dataset sampling
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
custom_style = {"grid.color": "black", "grid.linestyle": ":", "grid.linewidth": 0.3, "axes.edgecolor": "black", "ytick.left": True, "xtick.bottom": True}
sns.set_context("notebook")
sns.set_theme(style="whitegrid", rc=custom_style)
# Read the dataset
df = pd.read_csv("../data/global-data-on-sustainable-energy.csv")
Statistical Relationship
1. Feature Correlation
Correlation coefficients are indicators of the strength of the linear relationship between two different variables. A bigger circle means a higher correlation. The color of the circle indicates the sign of the correlation. A negative correlation (indicated by a blue color) means that the two variables move in opposite directions (when a variable is increasing, the other is decreasing). A positive correlation (indicated by a red color) means that the two variables move in the same direction (when a variable is increasing the other is also increasing).
plt.figure(figsize=(20,20))
corr_matrix = df.iloc[:, 2:21].corr()
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5)
plt.title("Correlation matrix of the dataset", fontsize=20, fontweight="bold")
plt.savefig("../img/correlation_matrix.png", dpi=200, bbox_inches="tight")
We can clearly see that some features are highly corrolated to each other but other are not. For instance the Financial flow to developing countries (US$)
has only poor correlation with all other features. It would be hard to predict this feature based on the other ones. This is the same for the gdp_growth
, Longitude
, Energy intensity level of primary energy
features. It could have been interesting to work with these features but we will not use them for the project.
An interesting thing to note is that the Latitude
feature is quite positively well corrolated with the Access to electricity
and Access to clean fuels for cooking
features. This means that the more you go to the north, the more you have access to electricity and clean fuels for cooking. This is not surprising since the north is more developed than the south (look also for the gdp_per_capita
correlation score). This is also the case for the Primary energy consumption par capita
feature. Northern countries consume more energy than southern ones. They tend to have a higher impact on climate change than southern countries.
Now, we can set an objective for the project. We would like to predict:
- The CO2 emissions
- Primary energy consumption per capita
These are mainly regression problems.
Predicting the percentage of Renewables energy as primary energy could have been also interesting to do but the response feature is so poorly filled that we can’t use it as a response feature (cf the dataset description section).
2. Feature selection
Selection for the CO2 emissions prediction
Looking at the heatmap, we can clearly see that a linear relationship exists between some factors and the response (Value_co2_emissions_kt_by_country
). Taking a threshold of 0.5 for the correlation score will make us takinng important feature to compute a regression after that.
co2_cols_to_keep = [column for column in corr_matrix.columns if abs(corr_matrix.loc['Value_co2_emissions_kt_by_country', column]) > 0.5]
co2_cols_to_keep
['Electricity from fossil fuels (TWh)',
'Electricity from nuclear (TWh)',
'Electricity from renewables (TWh)',
'Value_co2_emissions_kt_by_country',
'Land Area(Km2)']
df_co2 = df[['Entity', 'Year'] + co2_cols_to_keep]
df_co2.to_csv("../data/co2/co2_dataset.csv", index=False)
df_co2
Entity | Year | Electricity from fossil fuels (TWh) | Electricity from nuclear (TWh) | Electricity from renewables (TWh) | Value_co2_emissions_kt_by_country | Land Area(Km2) | |
---|---|---|---|---|---|---|---|
0 | Afghanistan | 2000 | 0.16 | 0.0 | 0.31 | 760.000000 | 652230.0 |
1 | Afghanistan | 2001 | 0.09 | 0.0 | 0.50 | 730.000000 | 652230.0 |
2 | Afghanistan | 2002 | 0.13 | 0.0 | 0.56 | 1029.999971 | 652230.0 |
3 | Afghanistan | 2003 | 0.31 | 0.0 | 0.63 | 1220.000029 | 652230.0 |
4 | Afghanistan | 2004 | 0.33 | 0.0 | 0.56 | 1029.999971 | 652230.0 |
... | ... | ... | ... | ... | ... | ... | ... |
3644 | Zimbabwe | 2016 | 3.50 | 0.0 | 3.32 | 11020.000460 | 390757.0 |
3645 | Zimbabwe | 2017 | 3.05 | 0.0 | 4.30 | 10340.000150 | 390757.0 |
3646 | Zimbabwe | 2018 | 3.73 | 0.0 | 5.46 | 12380.000110 | 390757.0 |
3647 | Zimbabwe | 2019 | 3.66 | 0.0 | 4.58 | 11760.000230 | 390757.0 |
3648 | Zimbabwe | 2020 | 3.40 | 0.0 | 4.19 | NaN | 390757.0 |
3649 rows × 7 columns
For this dataset, we will have:
Features | Factors | Response |
---|---|---|
Year | ||
Electricity from fossil fuels (TWh) | X | |
Electricity from nuclear (TWh) | X | |
Electricity from renewables (TWh) | X | |
Land Area (Km2) | X | |
Value_co2_emissions_kt_by_country | X |
Remember also that according to the dataset description, the Value_co2_emissions_kt_by_country
has some missing values that we will have to deal with (drop rows or impute values).
Selection for the Primary energy consumption per capita
pe_cols_to_keep = [column for column in corr_matrix.columns if abs(corr_matrix.loc['Primary energy consumption per capita (kWh/person)', column]) > 0.4]
pe_cols_to_keep
['Access to electricity (% of population)',
'Access to clean fuels for cooking',
'Renewable energy share in the total final energy consumption (%)',
'Primary energy consumption per capita (kWh/person)',
'gdp_per_capita']
df_pe = df[['Entity', 'Year'] + pe_cols_to_keep]
df_pe.to_csv("../data/pe/pe_dataset.csv", index=False)
df_pe
Entity | Year | Access to electricity (% of population) | Access to clean fuels for cooking | Renewable energy share in the total final energy consumption (%) | Primary energy consumption per capita (kWh/person) | gdp_per_capita | |
---|---|---|---|---|---|---|---|
0 | Afghanistan | 2000 | 1.613591 | 6.2 | 44.99 | 302.59482 | NaN |
1 | Afghanistan | 2001 | 4.074574 | 7.2 | 45.60 | 236.89185 | NaN |
2 | Afghanistan | 2002 | 9.409158 | 8.2 | 37.83 | 210.86215 | 179.426579 |
3 | Afghanistan | 2003 | 14.738506 | 9.5 | 36.66 | 229.96822 | 190.683814 |
4 | Afghanistan | 2004 | 20.064968 | 10.9 | 44.24 | 204.23125 | 211.382074 |
... | ... | ... | ... | ... | ... | ... | ... |
3644 | Zimbabwe | 2016 | 42.561730 | 29.8 | 81.90 | 3227.68020 | 1464.588957 |
3645 | Zimbabwe | 2017 | 44.178635 | 29.8 | 82.46 | 3068.01150 | 1235.189032 |
3646 | Zimbabwe | 2018 | 45.572647 | 29.9 | 80.23 | 3441.98580 | 1254.642265 |
3647 | Zimbabwe | 2019 | 46.781475 | 30.1 | 81.50 | 3003.65530 | 1316.740657 |
3648 | Zimbabwe | 2020 | 52.747670 | 30.4 | 81.90 | 2680.13180 | 1214.509820 |
3649 rows × 7 columns
For this dataset, we will have:
Features | Factors | Response |
---|---|---|
Year | ||
Access to electricity (% of population) | X | |
Access to clean fuels for cooking | X | |
Renewable energy share in the total final energy consumption (%) | X | |
Primary energy consumption per capita (kWh/person) | X | |
GDP per capita | X |
3. Higher dimension relationship
CO2 emissions prediction
# This prints the number of rows with at least one missing value
df_co2.shape[0] - df_co2.dropna().shape[0]
548
This dataset does not contain a lot of missing values. If we drop rows that are containing we still have more than 3000 samples to train our model. As we have 5 features, this number if samples is sufficient to train a regression model. There is no need of finding higher dimensional relationship.
Renewable energy consumption prediction
df_pe.shape[0] - df_pe.dropna().shape[0]
622
Same result are the CO2 emission dataset: even though we drop all the rows contaning missing values, we still have around 3000 samples to train our model. This is sufficient for a regression model with 5 features.
Feature Sampling
To have a better distribution, we’ll normalize our data.
def normalizeDataFrame(df):
return (df - df.min()) / (df.max() - df.min())
def saveDataset(df_to_save, original_df, path):
df_to_save.insert(0, "Entity", original_df["Entity"])
df_to_save.insert(1, "Year", original_df["Year"])
df_to_save.to_csv(path, index=False)
CO2 emissions prediction
normalized_df_co2 = normalizeDataFrame(df_co2.iloc[:, 2:])
normalized_df_co2_to_save = normalized_df_co2.copy()
saveDataset(normalized_df_co2_to_save, df_co2, "../data/co2/normalized_co2_dataset.csv")
normalized_df_co2
Electricity from fossil fuels (TWh) | Electricity from nuclear (TWh) | Electricity from renewables (TWh) | Value_co2_emissions_kt_by_country | Land Area(Km2) | |
---|---|---|---|---|---|
0 | 0.000031 | 0.0 | 0.000142 | 0.000070 | 0.065321 |
1 | 0.000017 | 0.0 | 0.000229 | 0.000067 | 0.065321 |
2 | 0.000025 | 0.0 | 0.000256 | 0.000095 | 0.065321 |
3 | 0.000060 | 0.0 | 0.000288 | 0.000113 | 0.065321 |
4 | 0.000064 | 0.0 | 0.000256 | 0.000095 | 0.065321 |
... | ... | ... | ... | ... | ... |
3644 | 0.000675 | 0.0 | 0.001519 | 0.001028 | 0.039134 |
3645 | 0.000588 | 0.0 | 0.001968 | 0.000965 | 0.039134 |
3646 | 0.000720 | 0.0 | 0.002499 | 0.001155 | 0.039134 |
3647 | 0.000706 | 0.0 | 0.002096 | 0.001097 | 0.039134 |
3648 | 0.000656 | 0.0 | 0.001918 | NaN | 0.039134 |
3649 rows × 5 columns
sns.pairplot(normalized_df_co2, height=4, kind="scatter", diag_kind="kde", diag_kws={"linewidth": 1.5, "color": "red"})
We can clearly see correlations on scatter plot between features. For feature distribution, is hard to see because countries that are emitting a lot of CO2 are so few compared to those who are not. Computing logarithmic values of the dataset could be a possibility but many samples have 0 as value for the Electricity from nuclear (TWh)
feature. As log(0) is not defined, we can’t use this method. Let’s try to remove some outliers.
top_outliers_co2 = df_co2['Value_co2_emissions_kt_by_country'].quantile(0.95)
normalized_df_co2_2 = normalizeDataFrame(df_co2[df_co2['Value_co2_emissions_kt_by_country'] < top_outliers_co2].iloc[:, 2:])
normalized_df_co2_2_to_save = normalized_df_co2_2.copy()
saveDataset(normalized_df_co2_2_to_save, df_co2[df_co2['Value_co2_emissions_kt_by_country'] < top_outliers_co2], "../data/co2/normalized_co2_dataset_without_outliers.csv")
print(normalized_df_co2_2.shape)
sns.pairplot(normalized_df_co2_2, height=4, kind="scatter", diag_kind="kde", diag_kws={"linewidth": 1.5, "color": "red"})
plt.savefig("../img/pairplot_co2.png", dpi=200, bbox_inches="tight")
(3059, 5)
This is better ! We still have enough samples to train our model and it is a bit easier to see that our data are quite randomly sampled. We have continuous samples and it might seems that the response feature is following a Weibull or Log-normal distribution.
There is also sufficient variation across all features to support statistical modeling.
Primary energy consumption per capita
normalized_df_pe = normalizeDataFrame(df_pe.iloc[:, 2:])
normalized_df_pe_to_save = normalized_df_pe.copy()
saveDataset(normalized_df_pe_to_save, df_pe, "../data/pe/normalized_pe_dataset.csv")
normalized_df_pe
Access to electricity (% of population) | Access to clean fuels for cooking | Renewable energy share in the total final energy consumption (%) | Primary energy consumption per capita (kWh/person) | gdp_per_capita | |
---|---|---|---|---|---|
0 | 0.003659 | 0.062 | 0.468451 | 0.001152 | NaN |
1 | 0.028581 | 0.072 | 0.474802 | 0.000902 | NaN |
2 | 0.082603 | 0.082 | 0.393898 | 0.000803 | 0.000547 |
3 | 0.136573 | 0.095 | 0.381716 | 0.000876 | 0.000638 |
4 | 0.190513 | 0.109 | 0.460641 | 0.000778 | 0.000806 |
... | ... | ... | ... | ... | ... |
3644 | 0.418333 | 0.298 | 0.852770 | 0.012292 | 0.010961 |
3645 | 0.434707 | 0.298 | 0.858601 | 0.011684 | 0.009102 |
3646 | 0.448824 | 0.299 | 0.835381 | 0.013108 | 0.009260 |
3647 | 0.461066 | 0.301 | 0.848605 | 0.011439 | 0.009763 |
3648 | 0.521484 | 0.304 | 0.852770 | 0.010207 | 0.008935 |
3649 rows × 5 columns
sns.pairplot(normalized_df_pe, height=4, kind="scatter", diag_kind="kde", diag_kws={"linewidth": 1.5, "color": "red"})
For this problem, it is easier to see that all the input features are well distributed: our data are more randomly sampled than for the CO2 problem. However, it is a bit more difficult to see that for the response feature. We can try to remove some outliers to have a better distribution.
top_outliers_pe = df_pe['Primary energy consumption per capita (kWh/person)'].quantile(0.95)
normalized_df_pe_2 = normalizeDataFrame(df_pe[df_pe['Primary energy consumption per capita (kWh/person)'] < top_outliers_pe].iloc[:, 2:])
normalized_df_pe_2_to_save = normalized_df_pe_2.copy()
saveDataset(normalized_df_pe_2_to_save, df_pe[df_pe['Primary energy consumption per capita (kWh/person)'] < top_outliers_pe], "../data/pe/normalized_pe_dataset_without_outliers.csv")
print(normalized_df_pe_2.shape)
sns.pairplot(normalized_df_pe_2, height=4, kind="scatter", diag_kind="kde", diag_kws={"linewidth": 1.5, "color": "red"})
plt.savefig("../img/pairplot_pe.png", dpi=200, bbox_inches="tight")
(3466, 5)
Only by removing 5% of the outliers, we can see that the distribution of the continuous response feature is better ! Moreover, the shape of this distribution looks like a Log-normal distribution.
There is also more variation across all these features to support statistical modeling than for the CO2 emissions prediction problem, which is good.