.
Mastodon Icon GitHub Icon LinkedIn Icon RSS Icon

Remove custom tags from SingleFile output

Or, how I cut 2 Mb of Readwise scripts from each archived pages.

I talk often about the preservation of our digital heritage. The internet (and the digital world in general) has the peculiar property of being both very easy to preserve and very fragile and ephemeral.

Web pages are among the most delicate parts of the digital world. We lose pages every day. As Matt Birchler experimented with his collection this month, 25.9% of his saved links disappeared in the last 15 years. A quarter of what we thought interesting is now gone.

Web pages are only preserved by archival websites such as Archive.org or people who manually save pages. Like I do.

SingleFile and the “Readwise Problem”

My tool of choice is SingleFile. A browser extension for Firefox, Safari, and Chrome that saves an entire page in a single standard HTML file. CSS is embedded. Scripts are embedded. Even images are Base64 encoded and embedded in a single HTML file that can be read by everything.

However, SingleFile has some unpredictable interactions with the Readwise and Grammarly extensions. In fact, such extensions inject in the web page extra data inside custom tags (namely, readwise-tooltip-container and grammarly-desktop-integration) that SingleFile just archives with the rest of the page.

This may not be a big problem if not because Readwise injects 2Mb of minimized JavaScript that will blow up all your archived pages by 2Mb+.

That’s how I noticed the problem. My archived files started to be 3Mb and more, even for simple plain text snapshots.

The good news is that this can be fixed. The bad news is that it is not very straightforward.

Filtering HTML Tags in SingleFile

I wish there was a more user-friendly way to filter custom tags, but there is not. So, for now, you have to follow these instructions.

First, we need to enable a hidden setting in SingleFile.

  1. Go to the SingleFile extension settings and export your profile.
  1. This will save a JSON file. Open this file with any text file editor and look for the "userScriptEnabled": false line and change it to true ("userScriptEnabled": true).
  2. Finally, re-import this file into SingleFile.

Now, we need to implement the actual filtering script. For that, we will use Violentmonkey (the extension is also available for Chrome).

  1. Install the Violentmonkey (or equivalent extensions).
  2. Create a new script.
  1. This will open an editor window. In it, paste the following code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// ==UserScript==
// @name         Exclude Readwise/Grammarly
// @namespace    https://github.com/gildas-lormeau/SingleFile
// @version      1.0
// @description  [SingleFile] Exclude Readwise/Grammarly
// @author       Davide Aversa
// @match        *://*/*
// @grant        none
// ==/UserScript==

const removedElementsSelector = "readwise-tooltip-container, grammarly-desktop-integration";
dispatchEvent(new CustomEvent("single-file-user-script-init"));
addEventListener("single-file-on-before-capture-request", () => {
  document.querySelectorAll(removedElementsSelector).forEach(element => element.remove());
});
  1. In the removedElementsSelector you list all the HTML tags you want to filter out (you can use any valid CSS selector rule, so you can also filter by ID or class).
  2. Save and close. The script should now be active and running on your pages.

What does this script do? In short, it injects into the page a filtering function that runs every time we receive the single-file-on-before-capture-request (an event sent by the SingleFile extension).

At this point, you can save your page as usual and the custom tags should be removed from the final output. :)

Conclusion

I hope you may find this small how-to helpful!

See you next time!

comments powered by Disqus