I have to admit that I’m a digital packrat. According to my backup service, I have 18.4 million files. Some of these are system files or installed applications that I won’t get rid of. However, often they are unattributed or are one of a dozen files with the same name. Even more wasteful, many of these are identical variants of the same information.
I have a script that scans my drives regularly. It makes list of the files on the system. The scans include the size of each file, its timestamp, the path to the file and its filename. It’s comprehensive and automated (and a little inefficient).
However, it is simple to search. I can ‘grep’ a file name and find where its copies are located. In a recent mission of search and destroy, one file, ynew.exe, had 817 (!) copies. There were various builds of that file but many were exact copies of each other. Some of the files have been lurking since 2012. I didn’t realize how out-of-control things have become until I started developing a “de-replicate” tool.
Back to the topic of provenance, ynew is easy for me to identify since I wrote it myself. On the other hand, my font files are unattributed. When I move from one OS to another, I’ve usually copied the font files and installed them on the new system. Again, some of them are easy to recognize. “Bills-Font-5.ttf” came from a font generator service that took my handwriting and turned it into font. I’m not sure of the origin of others. Are they available on google fonts or only from a sketchy site that offers free font downloads? I don’t know whether I’ve already met the license requirements for using in a publication or commercially.
The metadata for many files is lost except for its timestamp and size. The timestamp can be a hint and the size is a shortcut to identify obviously different files.
.PDFs can also be mysterious. Where did I get this research paper on hypernatremia? And why? Who wrote this essay on creativity and when? Most of the mp3s that are enjoy are labeled (and replicated). Most have album and artist in the path and in the audio’s metadata. Other audio files aren’t as organized. The provenance of these files can be a mystery. Although music files usually have metadata inside them, I don’t know how to automatically evaluate such a massive set of files. Even images offer a jumbled mess of information. Photoshop and Lightroom just make it worse.
Trying to get ahead of the problem, I’m making metadata files surround the poetry that I have written. When I post it to Patreon, I’m also build metadata files to tie together drafts, links to the posts and details about the images I’ve used in the posts.
Provenance is a useful concept for museums and other cultural institutions. Their records of provenance provide a chain of documentation connecting the original creator with the current item.
With technology, that chain can become tenuous. I would like to simplify the chaos on my system, but I haven’t won the war and have only fought a few tentative skirmishes. I hope it’s worth it.