The work presented here is the result of a small personal project developed to explore the area of impact of collections images on Wikimedia/Wikipedia, using the example of the Imperial War Museums collections. I’ll start with the headline stats that this work has produced and then afterwards you can read about the approach I took. Stats are generally linked to source, or are taken from the summary here, where you can also explore all pages and images.
Key stats
Looking first at the overall number of images …
The Wikimedia Commons category “Collections_of_the_Imperial_War_Museum” and its sub-collections contain 58,824 files
One caveat is that a small number are user uploaded images of objects they have personally seen on display, but the majority are images directly from the collections that have been bulk uploaded. Wikimedia pages get relatively few views, so how many of these are actually used on Wikipedia pages?
6,739 distinct IWM collection images appear on 36,725 en.wikipedia.org pages. These pages had a total of 61.4 million views in March 2018
Note that is just English language Wikipedia. The stats suggest that the total number globally is closer to 100,000 pages and 100 million page views. But what about those images that are just thumbnails on multiple pages, things you couldn’t really describe as having impact?
The most widely used IWM collection image, Heinkel_He_111_during_the_Battle_of_Britain.jpg, appears on 9,512 en.wikipedia.org pages which collectively had 26.2 million views in March 2018
This is where deeper analysis is needed. It’s just a thumbnail on 9,505 of those pages, used to represent the Second World War, meaning it’s only a ‘content’ image on 7 pages. What if we just look at large images in the main content of pages?
Large (non-thumbnail) IWM collection images appear 10,389 times across 6,616 en.wikipedia.org pages. These pages had 41.7 million views in March 2018
We can see that’s far fewer pages, but still a lot of images, and a lot of views. Some images have been used a lot
The most widely used large image, Churchill_portrait_NYP_45063.jpg, appears on 108 pages. These pages had 2.8 million views in March 2018
What if we then narrow this down to just those images that are large and appear as the first image on any given page? In my mind this says these must be significant images in the context of the page and also pretty much guarantees they are ‘seen’ by anyone who visits. In the case above, that’s 14 of the 108 pages, but that’s just one example.
An IWM collection image is used as the first, large image on 2,546 pages on en.wikipedia.org. These pages had 2.99 million views in March 2018
We can see the figures are much lower but still, to me at least, they are mind blowing, especially when considering that the 800,000 IWM collection objects get, in total, about 100 – 120,000 views per month. Let’s take an example of an image …
The image MunichAgreement.jpg appears as a large image on 12 Wikipedia pages, and is the first image on three of these. Those three pages had 64,864 views in March 2018 (the 12 pages had 337,055 combined). The IWM collections page for that image had nine views
To underline that this is not just a chance case …
In March 2018 the total number of en.wikipedia.org page views where an IWM collections image was the first, large image was 2.99 million, whilst the total views for the equivalent collections items on the IWM site was about 15,000*
*this is an estimate as those with view counts below 5 were not counted
Addendum 15/4/18: I should point out that in no way are any of these stats intended to belittle those from IWM itself, more to flag the relative scale of access to collections objects that is provided through Wikipedia. Indeed the IWM collections are in themselves very significant to the organisation’s online presence, seeing over 182,000 unique visitors in March 2018 (this means over 26% of visitors to the IWM website visited the online collections, contributing almost 40% to the total page views), figures I think most cultural heritage organisations would rightly be proud of. The objects represented on Wikimedia/Wikipedia represent just a fraction of the entire collections and don’t contain some of the material that is most popular there, which actually highlights how much more potential there may be.
Background
I’ve always been aware that a) there are a lot of cultural heritage images on Wikimedia Commons; b) they are used extensively across a lot of Wikipedia pages; and c) some of those pages get a lot of traffic. In Spring 2016, when I joined Imperial War Museums to work on their online collections, I did some back of an envelope calculations across multiple platforms including Wikipedia, Twitter and Pinterest. For Wikipedia I found there were ca 60,000 IWM images on Wikimedia Commons and of those, 5,240 appeared on 15,572 pages on English Wikipedia which collectively achieved tens of millions of page views each month. Impressive numbers.
But I was equally aware that these headline figures can be a bit misleading, or at least hard to relate to without more context, and it was easy to cast doubt on their validity when you took examples that made a large contribution to those totals but actually didn’t feel like they achieved a lot of impact – take the case of the image that was used as the thematic icon for the Second World War and appeared as a tiny thumbnail on thousands of pages, or the image of children drinking milk that appeared on the much visited page for Margaret Thatcher but way down towards the end of a very long page.
What I always wanted to do was to go beyond the available data to look at not just where these images are in use but how they are used, and then extend that to make comparisons with the levels of access achieved by the collection provider’s own website. As suggested above, I was conscious that the position of an image on a page and how big it is presented would affect its impact. Also, looking at any given page or image, pages with multiple collections images, or images that are used across multiple pages, will achieve a greater impact. I am unaware of any analysis that has been done to aggregate and weight these figures, so this piece of work sets out to look at those issues and develop both a theoretical and technical framework that can be re-used across any collection.
Approach
Essentially I have taken the extensive usage and page view data available from a couple of notable Wikimedia Commons tools, namely Glamorous and BaGLAMa 2, then written some fairly simple code to assess the structure of each Wikipedia page that they appear on, collate and analyse data, and then bring in Google Analytics data available for the same period from IWM’s own site. It’s imprecise and no doubt could be much improved, but it’s a start. For those who want to explore more, see this extensive write-up on the work undertaken.
Beyond the aggregate counts covered in the stats above I have also tried to look at a formula that takes into account the position on any page, together with whether the image appears as a large content image or just a thumbnail. This could be particularly useful for measuring the potential impact of any image that appears in multiple places and formats on many pages, or a page that contains multiple images from one collection (and the impact might then be considered to be higher than just the page views).
To illustrate this, the en.wikipedia.org page for Winston_Churchill had over 1.4 million views in March 2018 and of 71 total images, 15 are from the IWM collections, and 13 of these are large images, yet the highest position is actually fifth, with the first ‘infobox’ image on the page being from Library and Archives Canada on Flickr. If we apply to each image the formula views/sqrt(position) and additionally divide by 10 for any thumbnails this means, for example, that the 25th image will have a count of one fifth of the page views that an image in the first position. Summing these calculated impact figures gives us a total of 4.65 million, which I argue is a more accurate reflection of the contribution IWM images make to this page. In contrast, the page for Elizabeth_II, which had 1.35 million views, just has a single IWM image and it’s the sixth image, so the calculated impact is 550,769. Clearly this is highly experimental, but it’s a starting point! The overall affect of this can be seen by displaying the full list of pages by page views and by impact.
Where next?
This work arose out of a desire to assess impact, and my approach was to look not just at raw figures for page views but also image size, position, and prevalence so that a more nuanced figure could be obtained to measure the impact of any given image or page. The problem I now face is how to test this – ranking by this ‘impact factor’ appears to do what I set out to achieve in that it pushes well used but lower/smaller images down the rankings whilst those images that are used as large images at the top of a page get more recognition. Equally pages with many images are shown to have more impact than those that have fewer, even if those images are higher on the page. All I can say is that it ‘feels right’ and is certainly better than raw data, but what actually is impact? These people aren’t subsequently interacting, by and large they aren’t visiting the IWM website, they aren’t buying anything. What they are doing however is seeing collections images used editorially, in the context of a given topic, at a scale that no individual institution could ever hope to achieve.
Any thoughts on where to take this next would be much appreciated!
Very interesting – thank you for this! Agree that weighting images by prominence makes a lot of sense in assessing their impact. I don’t know if anyone has more actual data, but I guess there would be a case for weighting lead images more highly as they will often be the only image anyone sees for an article (particularly on mobile where they become a header image).
I’d also observe that not all thumbnails are equal. Here, for instance, two of the thumbnails are presented in a gallery which has almost equal prominence to any other image. so dividing by 10 is overly harsh. By contrast the He-111s are often used at very tiny size as part of templates and probably should be penalised more.
Generally I think Wikipedia isn’t great at measuring its impact (see my thoughts on this here: https://meta.wikimedia.org/wiki/User:The_Land/Thinking_about_the_impact_of_the_Wikimedia_movement )
Hi Chris, thanks for the comments. I agree that this is quite a blunt way of doing it but it’s intended as a start. I had the same thoughts about lead images which is in part why I have made a particular case for these in the headline stats, but I accept that’s not really reflected in the overall figures as yet. In contrast I’ve also since noticed, for example, that some images can be by default ‘hidden’ in expandable panels, so in themselves should no doubt be in further penalised. The devil is always in the detail!
Thanks also for the link to that interesting post, and also adding a link on there to here. I wish I had discovered that beforehand, but it’s reassuring to see some convergence in thought!