Exploratory Data Analysis for my expenses using Python.¶

-Aadit Kapoor¶

We use the dataset in the form of csv from the ThomasCook Website to get my expenditure for 1 year.¶

Our aim is to discover:¶

Trends
The way I spend money
Prediction of the amount I may spend on a particular day.

# importing
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor,BaggingRegressor,GradientBoostingRegressor
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import seaborn as sns
import datetime

pd.options.mode.chained_assignment = None

Overview of the dataset¶

Let us first start by viewing our data and gathering some straight up facts from it.

# loading data
data = pd.read_csv("expenses.csv", parse_dates=["st_date","t_date"])
data.sort_values(by=["t_date", "st_date"], inplace=True) # Give the sorted data
data_copy = data.copy()
data.head(10)

# resetting index
print ("Length: ", len(data))
print ("Number of columns: ", len(data.columns))
print 
print ("Range of the data is from: ", data.t_date.min(), "to this: ", data.t_date.max())
print ("The maximum amount in the account is: ", "$",data.total_amt.max())
print ("Mode: ", data.total_amt.mode())
print ("Median: ", data.total_amt.median())
print ("Mean: ", data.total_amt.mean())

print ("Minimum amount in the account is: $", data.total_amt.min())
print()
print ("The first transaction to the account was: $", data.t_amount.values[0], "at: ", data["tran_details"][data["t_amount"] == data["t_amount"].values[0]])

Length:  45
Number of columns:  14
Range of the data is from:  2018-08-16 00:00:00 to this:  2019-11-02 00:00:00
The maximum amount in the account is:  $ 1000.0
Mode:  0    0.79
1    7.51
dtype: float64
Median:  12.02
Mean:  57.28288888888888
Minimum amount in the account is: $ 0.79

The first transaction to the account was: $ 100.0 at:  44    MEMO CASH ADVANCE BANK OF AMERICA CA0158 SAN J...
Name: tran_details, dtype: object

- Looking at the data we find that there are 45 rows (entries) and there are 14 columns. The data ranges from the period of August 2018 to August 2019.¶

A brief overview of the features are as follows:¶

st_date: Settlement date
t_date: Transaction date
ref_num: Reference Number
tran_details: Transaction details
t_amount: transaction amount
curr: Currency
s_amt: Settlement amount
curr_2: Currency
fee_amt: Fee amount
fee_gst: Fee GST
cross_charges: Cross currency charges
gst_cross: GST Cross
total_amt: Total amount
D_C: Debit or Credit

Data Preprocessing¶

Looking at the data, we eliminate a certain number of columns and then proceed with our investigation.

# data preprocessing functions

days = {"Monday":0, "Tuesday":1, "Wednesday":2, "Thursday":3, "Friday":4, "Saturday":5, "Sunday":6}

# convert date to day of the week
def convert_date_to_week(date):
    year, month, day = (int(x) for x in date.split('/'))   
    ans = datetime.date(year, month, day)
    return ans.strftime("%A")

data.set_index(data_copy.tran_details, inplace=True)

# removing the highest amount as it was deposited.
data.drop("Card Reload F0566I19SDD22005", inplace=True, axis=0)
data.drop("Card Reload F0566I19SCX88728", inplace=True, axis=0)

# resetting index
data.head()

# Reset the index
data.reset_index(drop=True, inplace=True)

# removing settlement date, reference number, curr, s_amt, curr_2, fee_amt, fee_gst, cross_charges, gst_cross, d_c
data.drop(columns=["st_date", "ref_num","total_amt","curr","s_amt","curr_2", "fee_amt", "fee_gst", "cross_charges", "gst_cross", "D_C"], axis=1,inplace=True)
data.head()

We removed all the unrequired columns as follows:¶

st_date: We do not care about the settlement date
ref_num: Reference number is no use to us
curr: We already know the currency (Everything is in $)
s_amt: Settlement amount
fee_amt: this amount is negligible
D_c: no use to the user

threshold = 10.0 # For a student, spending more than or equal to $10 is considered expensive (can be changed)
data["is_place_duplicated"] = data["tran_details"].duplicated()

Now we aim to set a new column to tackle the tran_details column and make it useful. We aim to apply feature engineering to set all the places where I have spent above the threshold 1 and the other places as 0.

"""
    return all the duplicated places.
"""
def get_duplicated_places(data, key="is_place_duplicated"):
    places = []
    for place, t_f in zip(data["tran_details"],data[key]):
        if t_f:
            places.append(place)
        else:
            pass
    return places

# get the count for each place.
def get_count():
    places = {"place":[], "count":[]}
    for place, count in zip(data.tran_details.value_counts().keys(), data.tran_details.value_counts().values):
        places["place"].append(place)
        places["count"].append(count)
    return places

for place in  (get_duplicated_places(data)):
    print (place)

DEBIT PURCHASE THE CHOCOLATE MARKET SAN FRANCISCO CA
ECOM DEBIT Completion UBR* PENDING.UBER.COM SAN FRANCISCO CA
ECOM DEBIT Completion AMZN Pickup Amzn.com/bill WA
DEBIT PURCHASE 7-ELEVEN 14288 SAN JOSE CA
DEBIT PURCHASE 7-ELEVEN 14288 SAN JOSE CA
ECOM DEBIT Completion AMZN Mktp US Amzn.com/bill WA
DEBIT PURCHASE 7-ELEVEN 14288 SAN JOSE CA
DEBIT PURCHASE 7-ELEVEN 95113 SAN JOSE CA
DEBIT PURCHASE 7-ELEVEN 95113 SAN JOSE CA
DEBIT PURCHASE 7-ELEVEN 14288 SAN JOSE CA
MEMO CASH ADVANCE BANK OF AMERICA CA0158 SAN JOSE CA
DEBIT PURCHASE H20 WIRELESS FORT LEE NJ
DEBIT PURCHASE THE CHOCOLATE MARKET SAN FRANCISCO CA
ECOM DEBIT Completion Lyft *Pending SAN FRANCISCO CA
DEBIT PURCHASE THE MARKET (SAFEWAY) San Jose CA

All the duplicated places where I have spent money.

data.head()

print ("Max duplicated place: ", data.tran_details.value_counts().max(), (get_count()["place"][0]))

Max duplicated place:  5 DEBIT PURCHASE 7-ELEVEN 14288 SAN JOSE CA

pd.DataFrame(get_count()).plot.bar(x="place",y="count")

<matplotlib.axes._subplots.AxesSubplot at 0x1a2d11d9b0>

We can see a distribution of the money spent in these places.

# Total money spent at each place.
# We removed these two keys as these were the money reloaded texts.
data.groupby("tran_details")["t_amount"].sum()

tran_details
DEBIT PURCHASE 7-ELEVEN 14288 SAN JOSE CA                        30.67
DEBIT PURCHASE 7-ELEVEN 95113 SAN JOSE CA                         8.14
DEBIT PURCHASE CHE PANDA EXPR SAN JOSE CA                        13.98
DEBIT PURCHASE CHE PASEO MRKT SAN JOSE CA                         1.93
DEBIT PURCHASE DICK'S SPORTING GOODS MILPITAS CA                 49.03
DEBIT PURCHASE H20 WIRELESS FORT LEE NJ                          30.00
DEBIT PURCHASE JACK IN THE BOX 0409 SAN JOSE CA                   6.87
DEBIT PURCHASE KOI PALACE EXPRESS SAN FRANCISCO CA               14.41
DEBIT PURCHASE MADRAS GROCERIES SUNNYVALE CA                      7.98
DEBIT PURCHASE NAPA FARMS MARKET TG SAN FRANCISCO CA              3.27
DEBIT PURCHASE NEW STAND AREA G SAN FRANCISCO CA                  2.99
DEBIT PURCHASE PAMF 685 COLEMAN MF SAN JOSE CA                   79.00
DEBIT PURCHASE SJSUNIBKSTORE #80110 SAN JOSE CA                 110.72
DEBIT PURCHASE SQU*SQ *JOE & THE JUIC San Francisco CA            7.83
DEBIT PURCHASE SUPERCUTS SAN JOSE CA                             19.95
DEBIT PURCHASE THE CHOCOLATE MARKET SAN FRANCISCO CA             34.63
DEBIT PURCHASE THE MARKET (SAFE) 2934 San Jose CA                 9.26
DEBIT PURCHASE THE MARKET (SAFEWAY) San Jose CA                  24.77
DEBIT PURCHASE WILEY BOOK PUBLISHERS INDIANAPOLIS IN             76.00
ECOM DEBIT Completion AMZN Mktp US Amzn.com/bill WA              32.74
ECOM DEBIT Completion AMZN Pickup Amzn.com/bill WA               28.35
ECOM DEBIT Completion Bigzpoon 4087754040 CA                      7.96
ECOM DEBIT Completion CONNECT4EDUCATION 703-880-1180 VA         109.95
ECOM DEBIT Completion Lyft *Pending SAN FRANCISCO CA             32.91
ECOM DEBIT Completion SP * PETE PEDRO MARIETTA GA                21.72
ECOM DEBIT Completion TCD*CENGAGE LEARNING 800-354-9706 KY      125.00
ECOM DEBIT Completion UBR* PENDING.UBER.COM SAN FRANCISCO CA    152.94
MEMO CASH ADVANCE BANK OF AMERICA CA0158 SAN JOSE CA            300.00
Name: t_amount, dtype: float64

# the max money spent
print ("Max money spent: ", data.groupby("tran_details")["t_amount"].sum().max())
data.groupby("tran_details")["t_amount"].sum()

Max money spent:  300.0

tran_details
DEBIT PURCHASE 7-ELEVEN 14288 SAN JOSE CA                        30.67
DEBIT PURCHASE 7-ELEVEN 95113 SAN JOSE CA                         8.14
DEBIT PURCHASE CHE PANDA EXPR SAN JOSE CA                        13.98
DEBIT PURCHASE CHE PASEO MRKT SAN JOSE CA                         1.93
DEBIT PURCHASE DICK'S SPORTING GOODS MILPITAS CA                 49.03
DEBIT PURCHASE H20 WIRELESS FORT LEE NJ                          30.00
DEBIT PURCHASE JACK IN THE BOX 0409 SAN JOSE CA                   6.87
DEBIT PURCHASE KOI PALACE EXPRESS SAN FRANCISCO CA               14.41
DEBIT PURCHASE MADRAS GROCERIES SUNNYVALE CA                      7.98
DEBIT PURCHASE NAPA FARMS MARKET TG SAN FRANCISCO CA              3.27
DEBIT PURCHASE NEW STAND AREA G SAN FRANCISCO CA                  2.99
DEBIT PURCHASE PAMF 685 COLEMAN MF SAN JOSE CA                   79.00
DEBIT PURCHASE SJSUNIBKSTORE #80110 SAN JOSE CA                 110.72
DEBIT PURCHASE SQU*SQ *JOE & THE JUIC San Francisco CA            7.83
DEBIT PURCHASE SUPERCUTS SAN JOSE CA                             19.95
DEBIT PURCHASE THE CHOCOLATE MARKET SAN FRANCISCO CA             34.63
DEBIT PURCHASE THE MARKET (SAFE) 2934 San Jose CA                 9.26
DEBIT PURCHASE THE MARKET (SAFEWAY) San Jose CA                  24.77
DEBIT PURCHASE WILEY BOOK PUBLISHERS INDIANAPOLIS IN             76.00
ECOM DEBIT Completion AMZN Mktp US Amzn.com/bill WA              32.74
ECOM DEBIT Completion AMZN Pickup Amzn.com/bill WA               28.35
ECOM DEBIT Completion Bigzpoon 4087754040 CA                      7.96
ECOM DEBIT Completion CONNECT4EDUCATION 703-880-1180 VA         109.95
ECOM DEBIT Completion Lyft *Pending SAN FRANCISCO CA             32.91
ECOM DEBIT Completion SP * PETE PEDRO MARIETTA GA                21.72
ECOM DEBIT Completion TCD*CENGAGE LEARNING 800-354-9706 KY      125.00
ECOM DEBIT Completion UBR* PENDING.UBER.COM SAN FRANCISCO CA    152.94
MEMO CASH ADVANCE BANK OF AMERICA CA0158 SAN JOSE CA            300.00
Name: t_amount, dtype: float64

print ("Min money spent: ", data.groupby("tran_details")["t_amount"].sum().min())

Min money spent:  1.93

print ("Mean money spent: ", data.groupby("tran_details")["t_amount"].sum().mean())

Mean money spent:  47.964285714285715

Here we can see the max, min and the mean spent on these places.

# create a new column (perform feature engineering)
def create_new_column():
    
    # replacing values of tran_details where t_amount is >= 10.0 with 1 or else 0 or we can use np.where()
    (data.tran_details[data.t_amount>=10.0]) = 1
    (data.tran_details[data.t_amount<10.0]) = 0

create_new_column() # create the new column

data.head()

As we can see, the tran_details have been changed according to the money I spent in t_amount following the threshold.
So now let us change the dates to days and calculate the most frequent date/day.

data["is_place_duplicated"]=data["is_place_duplicated"].map({False:0, True:1}) # changing the is_place_duplicated

data.head()

Changing the is_place_duplicated column to numerical form.

sns.lineplot(x="t_date",y="t_amount", data=data, hue="is_place_duplicated")

<matplotlib.axes._subplots.AxesSubplot at 0x1a2d003fd0>

A line plot that demonstrates the fluctuation of t_amount w.r.t t_date

sns.scatterplot(x="t_amount",y="t_date", data=data,hue="is_place_duplicated" )

<matplotlib.axes._subplots.AxesSubplot at 0x1a2cd63b38>

a scatter plot that represents on what date I spent the most money on.

sns.countplot(x="tran_details",data=data, hue="tran_details")

<matplotlib.axes._subplots.AxesSubplot at 0x1a2ce27e80>

As you can see, 1 representing the expensive places has a greater frequency of occuring than 0.

Getting individuals stats between expensive and not expensive places.

data.groupby("tran_details")["t_amount"].sum()

tran_details
0     101.55
1    1241.45
Name: t_amount, dtype: float64

As you can see, the expensive area costs around $1241 and I spent a lot on these areas.

data.groupby("tran_details")["t_amount"].mean()

tran_details
0     5.641667
1    49.658000
Name: t_amount, dtype: float64

As you can see, the expensive area mean costs around $49 and the average spending at an expensive area was about 50.

data.groupby("tran_details")["t_amount"].max()

tran_details
0      9.77
1    200.00
Name: t_amount, dtype: float64

As you can see, the max area costs around $200.

data.head()

Date at which I spent the most?¶

data.t_date[data.t_amount == np.max(data.t_amount.values)]

29   2019-05-20
Name: t_date, dtype: datetime64[ns]

The date at which I spent the most was: 2019-05-20.

Date at which I spent the least?¶

data.t_date[data.t_amount == np.min(data.t_amount.values)]

12   2019-02-05
23   2019-04-05
27   2019-04-26
Name: t_date, dtype: datetime64[ns]

There were three dates when I had the minimum amount of money.

Data Prepartion for prediction¶

data.head()

Now let us move with making a predicting model.
To achieve this, we first convert t_date into days and them encode them with a mapping.

"""
    convert()
    convert the date in to day.
"""
def convert(data):
    days = []
    for date in data["t_date"]:
        date = str(date).replace("00:00:00","")
        date = date.replace("-","/")
        days.append(convert_date_to_week(date))
    return days

# convert the column
data["t_date"] = convert(data)

Now we have about 3 features.

Now we have changed the dates to days, we now begin by mapping days to numbers.

# convert into mapping
data["t_date"] = data["t_date"].map(days)

data.head()

data.corr()

We see a correlation between the features.

sns.relplot(x="t_date", y="t_amount",data=data)

<seaborn.axisgrid.FacetGrid at 0x1a2edd5128>

sns.relplot(x="tran_details", y="t_amount",data=data)

<seaborn.axisgrid.FacetGrid at 0x1a2f0844a8>

sns.relplot(x="is_place_duplicated", y="t_amount",data=data)

<seaborn.axisgrid.FacetGrid at 0x1a2f187b00>

sns.pairplot(data,hue="tran_details")

<seaborn.axisgrid.PairGrid at 0x1a2c9878d0>

Prediction¶

features = data.drop("t_amount", axis=1).values
labels = data.t_amount.values

# splitting data
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, shuffle=True, random_state=42)

classifiers = {
    "linear regression": LinearRegression(),
    "svr": SVR(),
    "random forest":RandomForestRegressor(n_estimators=5),
    "Decision tree":DecisionTreeRegressor(),
    "BaggingRegressor":BaggingRegressor(),
    "Gradient Boosting":GradientBoostingRegressor(),
}

def train():
    for i, classifier in enumerate(classifiers):
        print ("Training with: ", i)
        classifiers.get(classifier).fit(features_train, labels_train)
        print("Training score: ", classifiers.get(classifier).score(features_train ,labels_train))
        print("Testing score: ", classifiers.get(classifier).score(features_test ,labels_test))
        print()

train()

Training with:  0
Training score:  0.2376959870500167
Testing score:  0.3981717628376832

Training with:  1
Training score:  -0.17361505860299298
Testing score:  -0.08759456430941825

Training with:  2
Training score:  0.38549543760370497
Testing score:  0.3294553182125629

Training with:  3
Training score:  0.4372731898394506
Testing score:  0.14911766044258212

Training with:  4
Training score:  0.38832419979790167
Testing score:  0.34273972101193007

Training with:  5
Training score:  0.4370434312071776
Testing score:  0.18207413623141863

/Users/aaditkapoor/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

We get awful results with the these regressors as our data is limited.

Prediction Examples¶

What is the estimated spending on a Saturday at a unique and an expensive place?¶

demo = np.array([6, 1, 0])
print ("Estimated cost on a Saturday is: $",classifiers["random forest"].predict(demo.reshape(1,-1))[0])

Estimated cost on a Saturday is: $ 48.359

6 stands for Saturday
1 stands for expensive place
0 stands for unique place

Conclusion¶

We are able to successfully gather sufficient about my expenses.
We see that my frequency of going to an expensive place (>10) was more than less expensive places.
As you can see, the expensive area mean costs around 49 and the average spending at expensive area was about 50.
As you can see, the expensive area costs around 1241 and I spent a lot on these areas.
Our model's metric were not upto to the mark as there was limitation on the data.
As we keep getting data, we will update our model.

	st_date	t_date	ref_num	tran_details	t_amount	curr	curr_2	fee_amt	fee_gst	total_amt	D_C
44	2018-08-18	2018-08-16	26025468	MEMO CASH ADVANCE BANK OF AMERICA CA0158 SAN J...	100.00	USD	USD	1.75	0.32	102.07	Dr
43	2018-08-19	2018-08-16	26021120	DEBIT PURCHASE CHE PANDA EXPR SAN JOSE CA	13.98	USD	USD	0.00	0.00	13.98	Dr
42	2018-10-16	2018-10-14	31840420	DEBIT PURCHASE PAMF 685 COLEMAN MF SAN JOSE CA	79.00	USD	USD	0.00	0.00	79.00	Dr
40	2018-12-20	2018-12-18	37923858	DEBIT PURCHASE THE CHOCOLATE MARKET SAN FRANCI...	4.88	USD	USD	0.00	0.00	4.88	Dr
41	2018-12-20	2018-12-18	37923361	DEBIT PURCHASE THE CHOCOLATE MARKET SAN FRANCI...	9.77	USD	USD	0.00	0.00	9.77	Dr
38	2018-12-21	2018-12-18	37923660	DEBIT PURCHASE NEW STAND AREA G SAN FRANCISCO CA	2.99	USD	USD	0.00	0.00	2.99	Dr
39	2018-12-21	2018-12-18	37921147	DEBIT PURCHASE KOI PALACE EXPRESS SAN FRANCISC...	14.41	USD	USD	0.00	0.00	14.41	Dr
20	2019-04-04	2019-01-04	46833172	DEBIT PURCHASE 7-ELEVEN 14288 SAN JOSE CA	7.51	USD	USD	0.00	0.00	7.51	Dr
36	2019-01-23	2019-01-21	40358666	ECOM DEBIT Completion UBR* PENDING.UBER.COM SA...	76.47	USD	USD	0.00	0.00	76.47	Dr
37	2019-01-23	2019-01-21	40358393	ECOM DEBIT Completion UBR* PENDING.UBER.COM SA...	76.47	USD	USD	0.00	0.00	5.00	Dr

	t_date	t_amount	is_place_duplicated
t_date	1.000000	-0.248738	-0.020201
t_amount	-0.248738	1.000000	-0.067875
is_place_duplicated	-0.020201	-0.067875	1.000000

	t_date	tran_details	t_amount	is_place_duplicated
0	2018-08-16	1	100.00	False
1	2018-08-16	1	13.98	False
2	2018-10-14	1	79.00	False
3	2018-12-18	0	4.88	False
4	2018-12-18	0	9.77	True