Readeck Content Scripts
Introduction
Content Scripts are a powerful way to extend Readeck content extraction capabilities. These scripts are written in JavaScript (ES5) and executed while a link is saved by Readeck.
Your installation comes with a few default scripts and you can create your own.
Here are some use cases for Content Scripts:
- Extend or fix an extraction configuration,
- Retrieve more metadata through a website API,
- Transform content to a type (image or video),
- Extract a video transcript.
Getting started with Content Scripts
Before writing our first script, it's advise to edit your configuration file (config.toml
) in the main section:
[main]
dev_mode = true
log_level = "debug"
The dev_mode
will scan for content scripts each time you save a link (otherwise, you'll have to restart Readeck on each change). The change of log level will help you debuging your script.
Your first Content Script will be a JavaScript file located in the content-scripts
folder in your data directory. So let's first create a file content-scripts/test.js
to get our feet wet.
Add the following in the script:
exports.isActive = function() {
console.warn("---------------------", $.domain)
return true
}
This does two things: show the domain in the logs and activate the script. If all goes well, you should see something like this in your log while extracting a link:
WARN[0037] ---------------------theguardian.com @id=04251692f490/7sk2ggGlPX-000006 bookmark_id=11 script=data/content-scripts/test.js
Tip
We used a warning level to easily distinguish our script console.log
.
Let's try something a bit more real. We want every link save from theguardian.com
to be set as a photo and with the same title.
exports.isActive = function() {
// This script works only for this domain
return $.domain == "theguardian.com"
}
exports.processMeta = function() {
// Change the type, it's always a picture
$.type = "photo"
// Use a fixed title
$.title = "It's always the same title"
}
Now, if you save any link from theguardian.com
, they will all have the same title and a picture type. Maybe don't keep this script around when you're done with it.
Examples
Readeck has a few built-in Content Scripts that you can check-out as good starting points for changes you'd like to make. You'll find them all on the source code repository.
For small configuration change, you can have a look at the Configuration Content Script.
For a more involved Content Script, the Youtube Content Script can give you an idea of how far you can go.
Default site configuration
For many websites, Readeck embeds site configuration files. You can find the default files on the source code repository. These are JSON files that are loaded based on the link's domain name. There can be mistakes in these files (because a change was not tracked) and a Content Script is the best way to fix them.
Credit is due to Wallabag and FiveFilters for maintaining the original set of configurations.
Content Scripts API
The main content script API consists in exporting some functions that can perform operations on the current extracted information.
priority
exports.priority = 0
This is a integer value, defaulting to 0
when unset. The higher the number, the later the script will run. For a script overriding the site configuration with setConfig
, you'll need to set it to a value higher than 10
to ensure the script runs last.
isActive
exports.isActive()
This function must return a boolean to indicate that the script can run in the current context.
If the function is absent from the script, the other functions will never run.
// Always run
exports.isActive = function() {
return true
}
// Only run on a specific domain
exports.isActive = function() {
return $.domain == "youtube.com"
}
setConfig
exports.setConfig(config)
This function receives an SiteConfiguration object reference. It can set properties of the object as long as the value types don't change.
exports.setConfig = function(config) {
// Override TitleSelectors
config.titleSelectors = ["/html/head/title"]
// Append a body selector
config.bodySelectors.push("//main")
}
processMeta
exports.processMeta()
This function runs after loading the page meta data.
Global variables and functions
$
: extractor information
The global variable $
holds everything that's needed to read or change information on the current extraction process.
Note
When a property of $
is a list you can not use push()
to append new values. You must instead reassign the value with the existing list and anything you need to append.
$.domain
(read only)
The domain of the current extraction. Note that it's different from the host name. For example, if the host name is www.example.co.uk
, the value of $.domain
is example.co.uk
.
The value is always in its Unicode form regardless of the initial input.
$.hostname
(read only)
The host name of the current extraction.
The value is always in its Unicode form regardless of the initial input.
$.url
(read only)
The URL of the current extraction. The value is a string that you can parse with new URL($.url)
when needed.
$.meta
This variable is an object whose values are lists of strings. For example:
{
"html.title": ["document title"]
}
$.meta
. You can not use push()
to add new values.
When setting values, you can use a list or a single string.
$.meta["html.title"] = "new title" // valid
$.meta["html.author"] = ["someone", "someone else"] // valid
$.authors
A list of found authors in the document.
Note: When setting this value, it must be a list and you can not use $.authors.push()
to add new values.
$.description
A string with the document description.
$.title
A string with the document title.
$.type
The document type. When settings this value, it must be one of "article", "photo" or "video".
$.html
(write only)
When settings a string to this variable, the whole extracted content is replaced. This is an advanced option and should only be used for content that are not articles (photos or videos).
$.readability
Whether readability is enabled for this content. It can be useful to set it to false when setting an HTML content with $.html
.
Please note that even though readability can be disabled, it won't disable the last cleaning pass that removes unwanted tags and attributes.
unescapeURL
/**
* @param {string} value - input URL
* @return {string}
*/
function unescapeURL(value)
This function transforms an escaped URL to its non escaped version.
decodeXML
/**
* @param {string} input
* @return {Object}
*/
function decodeXML(input)
This function decodes an XML text into an object than can be serialized into JSON or filtered.
requests
If you need to perform HTTP requests in a content script, you must use the requests
global object.
This is by no means a full featured or advanced HTTP client but it will let you perform simple requests and retrieve JSON or text responses.
const rsp = requests.get("https://nativerest.net/echo/get")
rsp.raiseForStatus()
const data = rsp.json()
requests.get(url, [headers])
This function performs a GET HTTP request and returns a response object.
An optional header
object can take header values for the request.
requests.post(url, data, [headers])
This function performs a POST HTTP requests and returns a response object. The data
parameter must be a string of the data you want to send.
An optional header
object can take header values for the request.
const rsp = requests.post(
"http://example.net/",
JSON.stringify({"a": "abc"}),
{"Content-Type": "application/json"},
)
response
object
response.status
This is the numeric status code.
response.headers
This contains all the response's headers.
response.raiseForStatus()
This function will throw an error if the status is not 2xx.
response.json()
This function returns an object that's the serialization of the response's body.
response.text()
This function returns the response's text content.
Types
Site Configuration
The setConfig
function receives a config
object that can be modified.
config.titleSelectors
- []string
XPath selectors for the document title.
config.bodySelectors
- []string
XPath selectors for the document body.
config.dateSelectors
- []string
XPath selectors for the document date.
config.authorSelectors
- []string
XPath selectors for the document authors.
config.stripSelectors
- []string
XPath selectors of elements that must be removed.
config.stripIdOrClass
- []string
List of IDs or classes that belong to elements that must be removed.
config.stripImageSrc
- []string
List of strings that, when present in an src
attribute of an image will trigger the element removal.
config.singlePageLinkSelectors
- []string
XPath selectors of elements whose href
attribute refers to a link to the full document.
config.nextPageLinkSelectors
- []string
XPath selectors of elements whose href
attribute refers to a link to the next page.
config.replaceStrings
- [][2]string
List of pairs of string replacement.
config.httpHeaders
- object
An object that contain HTTP headers being sent to every subsequent requests.