Evaluating link rot
Posted in daily
Tags :After reading Brian Suda’s article on link rot, I ran his script on my Pinboard collection, and discovered that around 12% of my bookmarked links are invalid.
Here’s the data used to plot the above graph. The total number of bookmarked links vary from year to year, with a high between 2010 and 2011.
Year | Successful Bookmarks | Bookmarks | Average |
---|---|---|---|
2006 | 80 | 109 | 73.39 |
2007 | 132 | 163 | 80.98 |
2008 | 185 | 237 | 78.05 |
2009 | 268 | 319 | 84.01 |
2010 | 757 | 855 | 88.53 |
2011 | 794 | 890 | 89.21 |
2012 | 202 | 230 | 87.82 |
2013 | 143 | 158 | 90.50 |
2014 | 25 | 26 | 96.15 |
2018 | 51 | 59 | 86.44 |
2019 | 56 | 59 | 94.91 |
2020 | 49 | 52 | 94.23 |
2021 | 37 | 37 | 100 |
2022 | 6 | 6 | 100 |
Totals | 2785 | 3200 | 88.87 |
Note: I didn’t use Pinboard between 2014 and 2017 😶
Brian’s script works like this:
The code looks through your bookmarks and attempts to fetch each URL. If the HTTP code is less than 400 we mark it as a success. Without manually checking every URL, there might be some false positives: people selling existing domains, hosting provider redirects, etc. If the status code was 400 or higher, we marked it as a failure. After some manual investigation, we realized that some domains were not allowing bots to crawl them. Our code was using cURL, which appears as a bot, so we faked a browser’s user-agent string and decreased our failure rate by ~4%.
Pinboard aka del.icio.us
I started to use del.icio.us back in 2006, when I discovered the service at “The Future of Web Apps London” and somehow forgot about it between 2014 and 2018.
I converted my account last year when Pinboard’s creator Maciej reached out to ask if we, original one-time payment users, would consider converting to a subscription model, helping him to continue maintaining and developing the service, and make a living out of it..
I was surprised that the numbers of invalid links weren’t higher, considering that a vast majority of the links of my blog are now invalid. There is probably a significant number of false positives among the 88.87% of valid bookmarks. Randomly clicking through old links turned up a fair amount of them.
I still need to finish my link checking script that replaces invalid links by a link to the Internet Archive Wayback Machine project.