just random datacamp competition using advanced data visualization to extract electricity price in Australia
- Understanding the local electricity market
- 💾 The data
- 💪 Competition challenge
- ⌛️ Exploratory Data Analysis!
- Insights 📍 : The distribution of daily electricity price
- Insights 📍 : Daily electricity demand in Victoria from January 2015 to October 2020
- Make a correlation plot
- Insights 📍 : Correlation Matrix
- Insights 📍 : Electricity price change
- Insights 📍 : Electricity prices change for 2015-2020 on holiday Season
- Insights 📍 : Electricity price change for 2015-2020 on School Season
- Splitting the data based on time
- Insights 📍 : Prediction error price between real price and prediction price
- Insights 📍 : Notice anything about the over forecasted days?
- Revenue generated per year using retail prices and a 70MWh storage system by the energy storage venture
Understanding the local electricity market
📖 Background You work for an energy company in Australia. Your company builds solar panel arrays and then sells the energy they produce to industrial customers. The company wants to expand to the city of Melbourne in the state of Victoria.
Prices and demand for electricity change every day. Customers pay for the energy received using a formula based on the local energy market's daily price.
Your company's pricing committee wants your team to estimate energy prices for the next 12-18 months to use those prices as the basis for contract negotiations.
In addition, the VP of strategy is researching investing in storage capacity (i.e., batteries) as a new source of revenue. The plan is to store some of the energy produced by the solar panels when pricing conditions are unfavorable and sell it by the next day on the open market if the prices are higher.
💾 The data
You have access to over five years of energy price and demand data (source):
- "date" - from January 1, 2015, to October 6, 2020.
- "demand" - daily electricity demand in MWh.
- "price" - recommended retail price in AUD/MWh.
- "demand_pos_price" - total daily demand at a positive price in MWh.
- "price_positive" - average positive price, weighted by the corresponding intraday demand in AUD/MWh.
- "demand_neg_price" - total daily demand at a negative price in MWh.
- "price_negative" - average negative price, weighted by the corresponding intraday demand in AUD/MWh.
- "frac_neg_price" - the fraction of the day when the demand traded at a negative price.
- "min_temperature" - minimum temperature during the day in Celsius.
- "max_temperature" - maximum temperature during the day in Celsius.
- "solar_exposure" - total daily sunlight energy in MJ/m^2.
- "rainfall" - daily rainfall in mm.
- "school_day" - "Y" if that day was a school day, "N" otherwise.
- "holiday" - "Y" if the day was a state or national holiday, "N" otherwise.
Note: The price was negative during some intraday intervals, so energy producers were paying buyers rather than vice-versa.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import xgboost as xgb
from xgboost import plot_importance, plot_tree
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import RandomizedSearchCV
# --- Set defaults of the notebook ---
sns.set(font="'Source Code Pro', monospace")
plt.rcParams["font.family"] = "'Source Code Pro', monospace"
import warnings
warnings.filterwarnings("ignore")
# Color Palettes
treasure_colors = ["#703728", "#c86b25", "#dc9555", "#fed56f", "#c89a37"]
pirate_colors = ["#010307", "#395461", "#449FAF", "#B1F4FC",
"#F4D499", "#835211"]
df_electricity = pd.read_csv('./data/energy_demand.csv', parse_dates=['date'],index_col =["date"])
df_electricity.head()
💪 Competition challenge
Create a report that covers the following:
- How do energy prices change throughout the year? Are there any patterns by season or month of the year?
- Build a forecast of daily energy prices the company can use as the basis of its financial planning.
- Provide guidance on how much revenue the energy storage venture could generate per year using retail prices and a 70MWh storage system.
color_pal = ["#F8766D", "#D39200", "#93AA00", "#00BA38", "#00C19F", "#00B9E3", "#619CFF", "#DB72FB"]
_ = df_electricity["price"].plot(style='.', figsize=(15,5), color=color_pal[0], title='Daily electricity price in Victoria from January 2015 to October 2020')
plt.ylim(0,300)
plt.ylabel("Price")
(df_electricity.describe()
.style
.highlight_max(axis=0,color="#c07fef")
.highlight_min(axis=0,color="#00FF00")
.set_caption("Statistics of Electricity in Australia for 2015-2020"))
fig, ax = plt.subplots(4, 1, figsize = (15, 20))
ax[0].tick_params(axis='both', which='major', labelsize=13)
ax[1].tick_params(axis='both', which='major', labelsize=13)
ax[2].tick_params(axis='both', which='major', labelsize=13)
ax[3].tick_params(axis='both', which='major', labelsize=13)
df_electricity.reset_index().plot(x='date', y='demand_pos_price', lw = 1, ax = ax[0])
ax[0].set_title("Daily electricity demand in Victoria from January 2015 to October 2020")
ax[0].set_ylabel("Demand Positive Price [MWh]")
ax[0].get_legend().remove()
df_electricity.reset_index().plot(x='date', y='demand_neg_price', lw = 1, color='red', ax = ax[1])
ax[1].set_ylabel("Demand Negative Price [MWh]")
ax[1].get_legend().remove()
df_electricity.reset_index().plot(x='date', y='price_positive', lw = 1, color='red', ax = ax[2])
ax[2].set_ylabel("Price Positive [AUD$/MWh]")
ax[2].get_legend().remove()
df_electricity.reset_index().plot(x='date', y='price_negative', lw = 1, color='red', ax = ax[3])
ax[3].set_ylabel("Price Negative [AUD$/MWh]")
ax[3].get_legend().remove()
Insights 📍 : Daily electricity demand in Victoria from January 2015 to October 2020
Based on the plot above show:
- Demand positive price shows some seasonality based on plot shown above</span>. Maybe this feature may benefit our model.
- Demand negative price We can see there is a significant difference electricity price through maximum price which is very far from the mean and standard deviation.
- Positive price mostly with small values except for some values that is greater than 1000.
- Price negative ocurred in each year with variation of values.
def heatmap(x, y, size):
fig, ax = plt.subplots()
# Mapping from column names to integer coordinates
x_labels = [v for v in sorted(x.unique())]
y_labels = [v for v in sorted(y.unique())]
x_to_num = {p[1]:p[0] for p in enumerate(x_labels)}
y_to_num = {p[1]:p[0] for p in enumerate(y_labels)}
size_scale = 500
ax.scatter(
x=x.map(x_to_num), # Use mapping for x
y=y.map(y_to_num), # Use mapping for y
s=size * size_scale, # Vector of square sizes, proportional to size parameter
marker='s' # Use square as scatterplot marker
)
# Show column labels on the axes
ax.set_xticks([x_to_num[v] for v in x_labels])
ax.set_xticklabels(x_labels, rotation=45, horizontalalignment='right')
ax.set_yticks([y_to_num[v] for v in y_labels])
ax.set_yticklabels(y_labels)
data = df_electricity
columns = df_electricity.columns
corr = data[columns].corr()
corr = pd.melt(corr.reset_index(), id_vars='index') # Unpivot the dataframe, so we can get pair of arrays for x and y
corr.columns = ['x', 'y', 'value']
heatmap(
x=corr['x'],
y=corr['y'],
size=corr['value'].abs()
)
plt.figure(figsize=(18, 12))
for i, (combi, df) in enumerate(df_electricity.groupby([df_electricity.index.year])):
ax = plt.subplot(2, 3, i+1, ymargin=0.5)
#print(df.num_sold.values.shape, df.num_sold.values)
ax.plot(df.price)
ax.set_title(combi)
plt.xticks(rotation=45)
plt.tight_layout(h_pad=3.0)
plt.suptitle('Energy prices change for 2015-2020', y=1.03)
plt.show()
Insights 📍 : Electricity price change
Based on the plot above show:
- There is a significant increase of electricity price in December 2015 after a stable price during January-September.> 2. The electricity price tends to be stable in range 50-150 AUD except in February where there is an increase of energy price.> 2. For 2016, significant increase of price occured in January and March followed by stable price during April-December> 3. For 2018-2019, There is a significant increase of electricity price around January-February.
plt.figure(figsize=(18, 12))
for i, (combi, df) in enumerate(df_electricity.groupby([df_electricity.index.year,"holiday"])):
ax = plt.subplot(3,4, i+1, ymargin=0.5)
#print(df.num_sold.values.shape, df.num_sold.values)
ax.plot(df.price)
ax.set_title(combi)
plt.xticks(rotation=45)
plt.suptitle('Energy prices change for 2015-2020 on holiday Season', y=1.03)
plt.figure(figsize=(18, 12))
for i, (combi, df) in enumerate(df_electricity.groupby([df_electricity.index.year,"school_day"])):
ax = plt.subplot(3,4, i+1, ymargin=0.5)
#print(df.num_sold.values.shape, df.num_sold.values)
ax.plot(df.price)
ax.set_title(combi)
plt.xticks(rotation=40)
plt.suptitle('Energy prices change for 2015-2020 on School Season', y=1.03)
Insights 📍 : Electricity price change for 2015-2020 on School Season
Based on the plot above show:
- There is a significant increase of electricity price in the end of 2015 on school season> 2. There is a similar pattern of the increase of electricity price on school season in February-March in 2018, 2019, 2020> 2. For annual non-holiday Season, There is an increase of electricity price in the beginning of the year in 2016, 2018, 2019
split_date = '01-Jan-2019'
df_train = df_electricity.loc[df_electricity.index <= split_date].copy()
df_test = df_electricity.loc[df_electricity.index > split_date].copy()
def create_features(df, label=None):
"""
Creates time series features from datetime index
"""
df['date'] = df.index
df['hour'] = df['date'].dt.hour
df['dayofweek'] = df['date'].dt.dayofweek
df['quarter'] = df['date'].dt.quarter
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
df['dayofyear'] = df['date'].dt.dayofyear
df['dayofmonth'] = df['date'].dt.day
df['weekofyear'] = df['date'].dt.isocalendar().week.astype(np.int64)
df["school_day"] = pd.get_dummies(df["school_day"])
df["school_day"] = pd.get_dummies(df["school_day"])
df['price_7_days_lag'] = df['price'].shift(7)
df['price_15_days_lag'] = df['price'].shift(15)
df['price_30_days_lag'] = df['price'].shift(30)
df['price_7_days_mean'] = df['price'].rolling(window = 7).mean()
df['price_15_days_mean'] = df['price'].rolling(window = 15).mean()
df['price_30_days_mean'] = df['price'].rolling(window = 30).mean()
df['price_7_days_std'] = df['price'].rolling(window = 7).std()
df['price_15_days_std'] = df['price'].rolling(window = 15).std()
df['price_30_days_std'] = df['price'].rolling(window = 30).std()
df['price_7_days_max'] = df['price'].rolling(window = 7).max()
df['price_15_days_max'] = df['price'].rolling(window = 15).max()
df['price_30_days_max'] = df['price'].rolling(window = 30).max()
df['price_7_days_min'] = df['price'].rolling(window = 7).min()
df['price_15_days_min'] = df['price'].rolling(window = 15).min()
df['price_30_days_min'] = df['price'].rolling(window = 30).min()
cols = ['hour','dayofweek','dayofyear','quarter','month','year','dayofmonth','weekofyear','price_positive','demand_pos_price','demand','demand_neg_price','price_negative','frac_neg_price']
#cols = ['hour','dayofweek','quarter','month','year','dayofyear','dayofmonth','weekofyear','demand_pos_price',]
for d in ('7', '15', '30'):
for c in ('lag', 'mean', 'std', 'max', 'min'):
cols.append(f'price_{d}_days_{c}')
X = df[cols]
if label:
y = df[label]
return X, y
return X
X_train, y_train = create_features(df_train, label='price')
X_test, y_test = create_features(df_test, label='price')
reg = xgb.XGBRegressor(n_estimators=100,eta=0.1)
fitted_xgb_model = reg.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
early_stopping_rounds=50,
verbose=True)
plt.plot(fitted_xgb_model.evals_result()['validation_0']
['rmse'])
plt.plot(fitted_xgb_model.evals_result()['validation_1']
['rmse'])
plt.ylabel('RMSE', fontsize=14)
plt.xlabel('Price', fontsize=14)
plt.legend(['Train', 'Val'], loc='upper right')
We can see the comparison between training rmse and validation rmse score where the is an underfitting problem in our case where maybe due to outlier electricity price in 2019-2020
feature_important = reg.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())
data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.nlargest(10, columns="score").plot(kind='barh', figsize = (20,10)) ## plot top 10 features
Based on the feature importances extracted from the model, price_positive feature, demand_neg_price, dayofweek,dayofyear and dayofyear show the most 5th of feature importances.
df_test['price_prediction'] = reg.predict(X_test)
price_all = pd.concat([df_test, df_train], sort=False)
_ = price_all[['price','price_prediction']].plot(figsize=(15, 5))
plt.ylim(0,1000)
f, ax = plt.subplots(1)
f.set_figheight(5)
f.set_figwidth(15)
_ = price_all[['price','price_prediction']].plot(ax=ax,style=['-','.'])
ax.set_xbound(lower='01-05-2019', upper='02-10-2020')
ax.set_ylim(0, 2000)
plot = plt.suptitle('May 2019 Forecast vs Actuals')
Insights 📍 : Prediction error price between real price and prediction price
Our model is generalizable for the next 12-18 months based on the plot above. In Machine Learninig, It is a mandatory to make our model to be generalizable as possible so that it can predict the real/actual value in real situation. There is a few errors around february and in the beginining of January 2020. If it can predict the same value for actual demand, it is a overfitting problem. We have to take into account this case to make better prediction that can mimic the actual value with fewer errors.
mean_squared_error(y_true=df_test['price'],y_pred=df_test['price_prediction'])
mean_absolute_error(y_true=df_test['price'],y_pred=df_test['price_prediction'])
df_test['error'] = df_test['price'] - df_test['price_prediction']
df_test['abs_error'] = df_test['error'].apply(np.abs)
error_by_day = df_test.groupby(['year','month','dayofmonth']).mean()[['price','price_prediction','error','abs_error']]
(error_by_day.sort_values('error', ascending=True).head(10)
.style
.bar(subset="error",color="#c07fef",vmax=error_by_day.error.quantile(0.95))
.highlight_max(axis=0,color="#fef70c")
)
(error_by_day.sort_values('abs_error', ascending=False).head(10)
.style
.bar(subset="abs_error",color="#c07fef",vmax=error_by_day.abs_error.quantile(0.95))
.highlight_max(axis=0,color="#fef70c")
)
(error_by_day.sort_values('abs_error', ascending=True).head(10)
.style
.bar(subset="abs_error",color="#c07fef",vmax=error_by_day.abs_error.quantile(0.95))
.highlight_max(axis=0,color="#fef70c")
)
predicton_vs_Actual = df_test.rename(columns={"price":"PRICE","price_prediction":"PRICE_PREDICTION"})
predicton_vs_Actual.head()
(predicton_vs_Actual.sample(5)
.style
.background_gradient(axis=0,subset=["PRICE","PRICE_PREDICTION"],cmap="Blues")
.set_caption("Comparison of Real price and Prediction price")
)
(predicton_vs_Actual.describe()
.style
.highlight_max(axis=0,color="#c07fef")
.highlight_min(axis=0,color="#00FF00")
.set_caption("Price for the next 12-18 months dataset")
)
train_df_result =df_train.groupby("year")["price","demand_pos_price","demand_neg_price","frac_neg_price","price_negative","demand"].agg(["mean","max","min"]).stack()
(train_df_result
.style
.background_gradient(axis=0,subset=["demand","price"],cmap="Blues")
.set_caption("Comparison of Real Demand and Prediction Demand")
)
test_df_result =df_test.groupby("year")["price","demand_pos_price","demand_neg_price","frac_neg_price","price_negative","demand","price_prediction"].agg(["mean","max","min"]).stack()
(test_df_result
.style
.background_gradient(axis=0,subset=["price","price_prediction"],cmap="Blues")
.set_caption("Comparison of Real Demand and Prediction Demand")
)
test_df_result =df_test.groupby("year")["price","demand_pos_price","demand_neg_price","frac_neg_price","price_negative","demand","price_prediction"].agg(["mean","max","min"]).stack()
(test_df_result
.style
.background_gradient(axis=0,subset=["price","price_prediction"],cmap="Blues")
.set_caption("Comparison of Real Demand and Prediction Demand")
)
df_electricity["total_price"] = df_electricity["demand"]*df_electricity["price"]
df_electricity.groupby(df_electricity.index.year)["price"].agg(["mean","max","min"]).unstack()
demand_negative_price = df_electricity[df_electricity["price_negative"] < 0]
demand_negative_price
demand_positive_price = df_electricity[df_electricity["price_negative"] >= 0]
demand_positive_price
demand_negative= (demand_negative_price.groupby(demand_negative_price.index.year)["total_price"].agg(["sum","max","min"])).reset_index()
demand_negative
This is the condition when the demand for negative price per year. You can see on 2020 where the minimum price show a negative value.
demand_positive = demand_positive_price.groupby(demand_positive_price.index.year)["total_price"].agg(["sum","max","min"]).reset_index()
demand_positive
This is the condition when the demand for positive price per year. You can see on per year where the minimum price show a an appreciation from the market that is shown on the positive electiricity price.
result = demand_negative.merge(demand_positive,on="date",suffixes =["_demand_pos","_demand_neg"])
result
result["profit"] =result["sum_demand_pos"]-result["sum_demand_neg"]
result
We then substract the result when the demand is positive and demand is negative based on the total price feature that we have been created. Most of the years shows a good result.
(result.sort_values("profit",ascending=False)
.style
.background_gradient(axis=0,subset=["profit"],cmap="Blues")
.set_caption("Revenue Generated per year in the daily electiricity in Victoria Australia")
)
This is the result of revenue generated per year. We can see a good revenue generated per year. If you find this notebook usefull, Please show your appreciation by upvoting this notebook. Thank you.
References
Here is a few references for creating advanced visualization using matplotlib 1.
1. You can see a lot of data visualization plots here Data Visualization with Python!↩