Recently, I was working with an investment management company who had an interesting problem to solve.

They had 10 years' of historical data on approximately 1500 companies. This was stored in a spreadsheet and, each month, for 10 years, they would copy the current sheet of companies, record any changes and add any new companies to the bottom. This spreadsheet had 120 worksheets - one for each month for the last 10 years - and each sheet had a at least 1000 rows - one for each company of interest at the time.

The challenge was that they needed to know when and how various attributes had changed for those companies. The information was all there in the spreadsheets, but extracting it was not easy.

Fortunately, I had Python and Pandas at my disposal. Here's how I tackled the problem.

First, lets's set things up to mimic the spreadsheets. We'll create three pandas dataframes, one for each of three months, and store those in a Python dictionary with the month as the key. (This is exactly how importing an Excel workbook with mutliple sheets would work). For good measure, we'll also include some inconsistency in the month names because that exists in the real data.

The dataframes will list three fictitious companies with a single 'status' attribute which can be either 'active' or 'inactive.':

In [1]:
from datetime import datetime
from IPython.display import Markdown
import pandas as pd

months = ['May 2018', 'June2018', 'July 2018']
companies = ['ABC Inc.', 'XYZ Ltd.', 'Acme Corp.']

company_data = {
    month: pd.DataFrame({'company': companies, 'status': ['active' for i in range(len(companies))]})
    for month in months
}

for month in months[1:]:
    company_data[month]['status'][1] = 'inactive'
    
company_data[months[2]]['status'][2] = 'inactive'

Next, let's define a function to display the contents of our dictionary so we can see how that has worked:

In [2]:
def display_company_data(company_data):
    for month, df in company_data.items():
        display(Markdown(f'** {month} **'))
        display(df)

display_company_data(company_data)

May 2018

company status
0 ABC Inc. active
1 XYZ Ltd. active
2 Acme Corp. active

June2018

company status
0 ABC Inc. active
1 XYZ Ltd. inactive
2 Acme Corp. active

July 2018

company status
0 ABC Inc. active
1 XYZ Ltd. inactive
2 Acme Corp. inactive

We'll need to deal with the dictionary keys. Currently, they are strings representing a month and they are in two different formats. Let's define a function to convert those strings to a Python datetime object using a list of possble string formats:

In [4]:
def text_to_month(text, index=0):
    formats = ["%B %Y", "%B%Y"] 
    try:
        return datetime.strptime(text, formats[index])
    except ValueError:
        return text_to_month(text, index + 1)
    except IndexError:
        raise ValueError(f"No valid datetime format found for {text}")

Now, we can convert the month names to proper datetime values and view the result:

In [5]:
company_data = {
    text_to_month(key): value
    for key, value in company_data.items()
}

display_company_data(company_data)

2018-05-01 00:00:00

company status
0 ABC Inc. active
1 XYZ Ltd. active
2 Acme Corp. active

2018-06-01 00:00:00

company status
0 ABC Inc. active
1 XYZ Ltd. inactive
2 Acme Corp. active

2018-07-01 00:00:00

company status
0 ABC Inc. active
1 XYZ Ltd. inactive
2 Acme Corp. inactive

Let's use the company name as the row index instead of the row number:

In [6]:
for month, df in company_data.items():
    df.set_index('company', inplace=True)
    
display_company_data(company_data)

2018-05-01 00:00:00

status
company
ABC Inc. active
XYZ Ltd. active
Acme Corp. active

2018-06-01 00:00:00

status
company
ABC Inc. active
XYZ Ltd. inactive
Acme Corp. active

2018-07-01 00:00:00

status
company
ABC Inc. active
XYZ Ltd. inactive
Acme Corp. inactive

Now, we can concatenate the dataframes into one with the month as an additional index:

In [7]:
df = pd.concat(company_data)
display(df)
status
company
2018-05-01 ABC Inc. active
XYZ Ltd. active
Acme Corp. active
2018-06-01 ABC Inc. active
XYZ Ltd. inactive
Acme Corp. active
2018-07-01 ABC Inc. active
XYZ Ltd. inactive
Acme Corp. inactive

We can swap the order of the month and company indexes and then sort by that order:

In [8]:
df = df.swaplevel(0, 1, axis=0).sort_index()
display(df)
status
company
ABC Inc. 2018-05-01 active
2018-06-01 active
2018-07-01 active
Acme Corp. 2018-05-01 active
2018-06-01 active
2018-07-01 inactive
XYZ Ltd. 2018-05-01 active
2018-06-01 inactive
2018-07-01 inactive

Finally, we use the shift() and groupby() methods to filter only the rows where the status has changed for that company:

In [9]:
display(df[df.groupby(['company']).status.shift().ne(df.status)])
status
company
ABC Inc. 2018-05-01 active
Acme Corp. 2018-05-01 active
2018-07-01 inactive
XYZ Ltd. 2018-05-01 active
2018-06-01 inactive


Comments

comments powered by Disqus