This analysis was made for the purpose of exploring the question of whether presence of a temporary exhibit has any impact on History Center attendance. This question is difficult to directly answer wholistically, because there are very few times when there is no temporary exhibit at the History Center. Those times are also very short term, and don't occur at the same time each year (making it difficult to separate seasonal attendance fluctuations from exhibit effects).
However, some trends can be observed which demonstrate that exhibits do have an effect on attendance, though that effect varies wildly depending on the exhibit and where the exhibit falls in the calendar year.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
from datetime import datetime as dt
raw = pd.read_csv('hc_ga.csv')
exh_sched = pd.read_csv('exhibit_sched.csv')
This table shows a glimpse of the raw data, which is used to produce all the plots in the rest of this notebook. The data was pulled from Tessitura, and thus contains History Center general attendance data from that system only.
raw.head()
This table contains the start and end dates of each exhibit, going back to 2014. Because this analysis focuses on Tessitura data, the only relevant exhibits are those that ended in 2016 or later.
exh_sched[exh_sched['end_year'] >= 2016]
cy161718 = raw.loc[raw['year'].isin([2016,2017,2018])]
g = sns.relplot(x='month', y='sli_no', data=cy161718.groupby(['month', 'year']).count().reset_index(), kind='line',
facet_kws={'subplot_kws': {'title': 'Monthly History Center Attendance (Aggregated Years)'}})
g.set_axis_labels('Month', 'Tickets')
This plot shows the average attendance for each month across the past three or so years as the dark blue line, and the lighter blue shading shows how much variation is in the data for that month. This plot clearly shows the relationship between time of year and History Center attendance, with peaks in March for spring break and in the summer. Bear in mind that this data does not include school groups, which is actually a boon for our purposes.
Note that the summer months have very little variance, suggesting that summer visitation for the History Center is consistently the highest of the year, and that that visitation is pretty consistently the same level over the years. This means that, during the summer, the seasonal increases in attendance could override all other factors that impact attendance. Likewise, there is much more variation in the spring and fall, suggesting that other factors are accentuated by the seasonal lull in attendance.
Now let's break out the previous plot by year. Instead of seeing the fluctuations as a shaded area, we'll see a line for each year in the data available. This means we can include 2019, which obviously bottoms out pretty quickly because it is still a year in progress.
g = sns.relplot(x='month', y='sli_no', data=raw.groupby(['month', 'year']).count().reset_index(), hue='year', kind='line',
facet_kws={'subplot_kws': {'title': 'Monthly History Center Attendance'}})
plt.axvline(9,0,13000)
g.set_axis_labels('Month', 'Tickets')
I've placed a line on September to demonstrate that this plot gives us the first clue toward the question of whether or not exhibits effect attendance. You'll note that attendance at the history center dips in September each year, but that in 2016 it dipped much more than in 2017 and 2018. Gridiron Glory, an exhibit that did not meet our attendance expectations, premiered in September of 2016.
exh_sched[exh_sched['exhibit'] == 'Gridiron Glory']
Therefore, already we can say that if a temporary exhibit at the History Center underperforms, it may have a negative impact on History Center attendance beyond the impact of seasonal fluctuations in attendance.
Next, let's break down the data over all the months of the year and see if we can spot any other interesting dips or spikes if we look at day-over-day attendance.
g = sns.FacetGrid(raw.groupby(['day', 'month', 'year']).count().reset_index(), col='month', col_wrap=4, hue='year')
g.map(plt.plot, 'day', 'sli_no')
g.set_axis_labels('Day', 'Tickets')
g.add_legend()
The above graphs could potentially indicate a few times where an exhibit had an impact on attendance. I'm particularly interested in spikes that obviously differ from that same day in the previous year, because that could be an indicator of exhibit impact, since year-to-year variations in attendance could be explained by changes in exhibits, especially if those changes are not also explained by variation in the calendar (i.e. weekdays and holidays shifting around). Let's dive in to some Noteworthy Spikes.
Just going chronologically, in the month = 1 (January) plot there is a very large spike in 2019. Let's take a closer look at that day specifically.
g = sns.relplot(x='day', y='sli_no',
data=raw[raw['month']==1].groupby(['day', 'month', 'year']).count().reset_index(), hue='year', kind='line',
facet_kws={'subplot_kws': {'title': 'History Center Attendance (January)'}})
g.set_axis_labels('Day', 'Tickets')
plt.axvline(21,0,1600)
So, the spike is right around the 20th/21st of January 2019. That was MLK weekend that year, which could definitely explain the spike.
g = sns.relplot(x='day', y='sli_no',
data=raw[raw['month']==1].groupby(['day', 'month', 'year']).count().reset_index(),
hue='year', kind='line',
facet_kws={'subplot_kws': {'title': 'History Center Attendance (January)'}})
g.set_axis_labels('Day', 'Tickets')
plt.axvline(15,0,1600)
Now, in 2018, MLK day was on the 15th, and there's a spike for that year on that day as well, but it's not nearly as high as the spike in 2019. What's the difference?
exh_sched[exh_sched['exhibit'] == '1968 exhibit']
Interesting! The 1968 Exhibit closed on 1/21/2019, which coincided with MLK day and resulted in double the attendance compared to MLK day 2018. This suggests that popular exhibits do have an impact on attendance above and beyond the expected lift from seasonal fluctuations.
Next, let's look at month = 3 (March). This one caught my eye despite not being as large of a spike because it sits in a trough of low attendance for the other two years in March.
g = sns.relplot(x='day', y='sli_no',
data=raw[raw['month']==3].groupby(['day', 'month', 'year']).count().reset_index(),
hue='year', kind='line',
facet_kws={'subplot_kws': {'title': 'History Center Attendance (March)'}})
g.set_axis_labels('Day', 'Tickets')
plt.axvline(20,0,1600)
Looks like March is a pretty volatile month. That could likely be because of variations in Saint Paul and Minneapolis spring break schedules, as well as possibly a consequence of unpredictable March weather. The largest spike (in 2016) could also potentially be a consequence of Suburbia closing on the 20th.
exh_sched[exh_sched['exhibit'] == 'Suburbia']
Intersting, the peak of the spike is the day before the 20th. The 19th was a Saturday in March 2016. Perhaps closing on the 19th instead of the 20th would have driven even more people to attend that day? Or possibly Sunday would have cratered instead of staying relatively higher than normal.
For June, we have another massive spike, this time on 6/23/2018.
g = sns.relplot(x='day', y='sli_no',
data=raw[raw['month']==6].groupby(['day', 'month', 'year']).count().reset_index(),
hue='year', kind='line',
facet_kws={'subplot_kws': {'title': 'History Center Attendance (June)'}})
g.set_axis_labels('Day', 'Tickets')
plt.axvline(23,0,1600)
plt.axhline(225,0,30, color='r')
plt.axhline(344.5,0,30, color='g')
exh_sched[exh_sched['exhibit'] == 'Somalis + Minnesota']
june = raw[(raw['month']==6) & (raw['year']==2018)]
before = june[june['day'] <= 22].groupby(['day']).count()
after = june[june['day'] >= 24].groupby(['day']).count()
print('Median before Somalis + Minnesota = ' + str(before.median()['sli_no']))
print('Median after Somalis + Minnesota = ' + str(after.median()['sli_no']))
And that just happens to coincide exactly with the opening of Somalis + Minnesota. Interestingly, this also gives us a first hint of a before/after effect of an exhibit opening. As you can see, the median attendance for June 2018 is lower before the 23rd (red line) than it is after the 23rd (green line). It also seems to be pretty close to the previous two years. It's still pretty close to the previous two years after the 23rd, so I'm not willing to say definitively that Somalis + Minnesota increased attendance during its entire run. Especially because, when looking at the data for the entire year, it appears that June typically rises toward the stable July/August level of attendance throughout the course of the month anyway. However, Somalis + Minnesota certainly caused a very large spike when it opened, and may have had an effect on attendance afterward.
November has a number of interesting spikes. For some reason in 2018, the spike around Thanksgiving was particularly large, nearly double the previous two years. I'm not sure why that would be! I couldn't find an immediately obvious explanation, but perhaps there was some sort of programming or advertising that year that attracted attention to the History Center?
g = sns.relplot(x='day', y='sli_no',
data=raw[raw['month']==11].groupby(['day', 'month', 'year']).count().reset_index(),
hue='year', kind='line',
facet_kws={'subplot_kws': {'title': 'History Center Attendance (November)'}})
g.set_axis_labels('Day', 'Tickets')
plt.axvline(4,0,1600)
plt.axvline(11,0,1600)
plt.axvline(24,0,1600)
exh_sched[exh_sched['exhibit'] == 'WWI America']
Either of the other two spikes could potentially be explained by the closure of WWI America, which closed some time in November of 2017. Unfortunately, I was unable to find an exact closure date for that exhibit on our site, so I can't say for sure whether these spikes are the result of exhibit influence. Even if one of them is the closure date, however, there are two spikes of almost the exact same magnitude. Perhaps in the lead up to WWI closing, there were programs that drew people to the History Center and they decided to stay for general admission?
December is noteworthy primarily because of its regularity. Most of the month is about the same year-over-year, and then attendance rises sharply as the month concludes and holiday vacation time kicks in. Of particular interest is the fact that 2017 saw a larger than usual spike on the 26th.
g = sns.relplot(x='day', y='sli_no',
data=raw[raw['month']==12].groupby(['day', 'month', 'year']).count().reset_index(),
hue='year', kind='line',
facet_kws={'subplot_kws': {'title': 'History Center Attendance (December)'}})
g.set_axis_labels('Day', 'Tickets')
plt.axvline(25,0,1600)
exh_sched[exh_sched['exhibit'] == '1968 exhibit']
It doesn't line up as neatly as the Somalis + Minnesota exhibit opening causing an enormous spike exactly on the day it opened, but 1968 opened on 12/23/2017, which could explain the unusual spike on the 26th. Perhaps the Christmas holiday had a delaying effect on the attendance spike of 1968's opening, prompting celebrants to wait until after their festivities were over before going to see the exhibit?
This section highlights some things I thought were interesting or noteworthy, including spikes I couldn't explain and also a distinct lack of spikes that could potentially be telling.
g = sns.FacetGrid(raw.loc[raw['month'].isin([7,8])].groupby(['day', 'month', 'year']).count().reset_index(), col='month', col_wrap=4, hue='year')
g.map(plt.plot, 'day', 'sli_no')
g.add_legend()
g.set_axis_labels('Day', 'Tickets')
I found July and August's extremely consistent behavior over three years to be very interesting. No huge spikes at all, which serves to highlight the impact of the noteworthy spikes I outlined earlier. If high attendance months like July and August are so consistent year-over-year, then that could indicate that the spikes seen around exhibit openings and closings elsewhere in the year are indeed related to the exhibits and not random chance.
g = sns.FacetGrid(raw.loc[raw['month'].isin([4,5,9,10])].groupby(['day', 'month', 'year']).count().reset_index(), col='month', col_wrap=4, hue='year')
g.map(plt.plot, 'day', 'sli_no')
g.add_legend()
g.set_axis_labels('Day', 'Tickets')
I also wanted to address several large spikes that I have no explanation for.
April has two large spikes, neither of which correspond to exhibit openings, unless WWI's opening on 4/8/17 somehow impacted attendance an entire year later.
May has another very large spike around the 20th, which I also have no explanation for. It's not Memorial Day weekend (the effects of that can be seen at the end of the month), and I wasn't able to find an exhibit that could have affected that day. Perhaps a program drew people in and they came to the museum as well?
Finally, both September and October have spikes over two years (only sharing one year, 2017) around the 20th of the month. I have no explanation for this either. It's certainly not labor day, and it's not school trips, since those are not included in this data. An enticing mystery!
This analysis provides evidence that exhibits do indeed have an impact on attendance at the History Center. When an underperforming exhibit is present, the History Center's overall attendance suffers more than would be expected based on the time of year. When a high performance exhibit opens or closes, it has an easily noticeable impact on daily attendance that goes above and beyond the effects of seasonal rhythms.
What is not clear from this analysis is how much of an impact the ongoing presence of an exhibit has on attendance at the history center. I feel fairly confident saying 1968 had an impact on attendance when it opened and when it closed, but did it result in a higher than normal attendance at the History Center throughout its tenure? This isn't as clear to me. I could argue either way, just comparing to the previous two years of month-over-month data, but I can draw no definitive conclusions at this time. This suggests avenues for future analysis. If we can ever isolate "non-exhibit" History Center attendance, then we could more definitively say whether an exhibit has a continuing effect on attendance.
Finally, I believe this analysis may indicate the potential of scheduling exhibit opening and closings near or even on holidays and vacation times. If 1968 drew unusual crowds on December 26th, and the History Center sees lift on Thanksgiving, Memorial Day, MLK Day (which was augmented by 1968 closing), then this may suggest an opportunity to not avoid "conflict" with holidays and vacation, but rather to in fact synergize with those days. Conventional wisdom certainly suggests that Twin Cities residents leave the area on days like these, but the data suggests that these days still result in higher than normal attendance for that month. Perhaps tailoring exhibit openings to residents who stay in town for those days, and who visit the History Center, could result in attendance lift?