Copyright Β© 2022-2023 Phil Brooke & Green Pike Ltd.Β Released under GPL v3, see the copyright section.
Outline of document processing
1πWrite slightly-enhanced Pandoc markdown into a file in input/whatever.txt. As with Markdown generally, the intention is that the source Markdown files should be generally readable without excessive use of tags or formatting instructions (although some are inevitable). This is the main rationale for not using Docbook (XML) (although assemblies have some potential for organising a knowledge base).
2πhttps://garrettgman.github.io/rmarkdown/authoring_pandoc_markdown.html provides reasonable detail of Pandocβs markdown notation.
Major changes
core.5 to core.6
3πThis is substantially revised from the core.5 version, which used multiple separate files. In particular, the use of Make is minimal and optional, and the main driver is (from core.6) a Python3 program.
core.19 to core.20
4πThe input/output file structure was reorganised so they mirror each other. The YAML configuration was changed to move common configuration stanzas into their own list. The local variable output_stem
was changed to file_stem
.
Longer details
14πConfiguration is available via a YAML file. Quite a lot can be configured: see settings
within the Python source. Consider using yamllint
to validate configuration files.
15πCustomisation is also available via
- macros (both simple and more complicated)
- CSS
- hacking the Python code β this is recommended when carrying out fixups, especially to the resulting HTML. One pattern is to set a local variable via the source document, then use a fixup to replace a placeholder in the Pandoc template
16πAll the outputs should magically appear in the output directory.
17πThe document sources and auxiliary files are stored in a Git-controlled repository. This means that tags and branches can be used to manage development versions, proposed versions and record an audit trail of changes and releases. By including the build instructions with any particular version, then it should remain possible to replicate the outputs of any commit (provided that the dependencies in section 14 continue to provide the same output).
Processor passes
18πThe Python program drives a series of incremental changes to the files. These intermediate steps are normally hidden, but can be shown with the --steps
command line option.
19πIn rough order, the passes:
- Read the configuration, input files and various macros
- Apply the macros / definitions (and remove them when no longer needed)
- Modify the Pandoc AST to add section and paragraph numbers.
- Build a set of cross-references, then apply them to resolve various values and links
- Insert back-references
- Remove some βstuffingβ data
- Use Pandoc to write HTML
- (optionally) use Chromium to convert HTML to PDF
Configuration structure
20πThe YAML file comprises a series of keys. The most notable are
outputs
which describes common groups of outputsconfigs
which describes common groups of definitionsshort
which gives the back-reference codes for each output file
21πThe configs
block gives:
config
β a name to refer to the configsecnum
andparanum
are boolean flags as to whether paragraph and section numbers should be showntoc
is optional and default to false. If set true, include a table of contentsdefns
β a list of the macro files to be used for these outputs
22πThe outputs
block gives:
dir
β the output directory β these can be shared amongst multiple output blocksconfig
β a reference to aconfig
in theconfigs
block. If omitted, then each file needs an entry inaconfs
aconfs
β alternative configs. A list of output name to a particular config. Useful for overriding the configuration for a file in a particular directory
html
β list of HTML outputs wantedpdf
β list of PDF outputs wanted β this must be a subset of the previouscatalogue_title
β set the name of the cataloguecatalogue_order
β a list of names to sort orders, where negative is at the endno_catalogue
β if setTrue
, inhibits generation of the catalogue filemap
β the relative path to other output blocks
Macros/definitions
Simple
23πThese definitions are contained in a file starting with the line simple
. Thereafter, the first word of each line is the directive to be searched for, and the rest of the line (after a single space) is the replacement text.
24πThese are most useful for short imperatives and applying consistent formatting with Pandoc.
Complex
25πThe first line is complex
, followed by a series of definitions.
26πEach definition takes the name of the directive/macro, the number of arguments (0 to 10) and then its definition (in [{β¦}]
).
27πThese are particularly useful for
- applying variation, e.g., a macro that only provides output for some variants and not others
- inserting common blocks of text
Built-in
28πThere is a single built-in macro, $include, which includes another file. The path is relative to the master working directory, not the including fileβs location.
Choice of delimiters
29πThe default is $
. !
works reasonably well, but is less visually clear.
30π@
would be better but repeatedly conflicts with existing features, particularly citation support.
31πThe macro expansions and their arguments are quoted with [{β¦}]
to make collisions less likely.
Choice of macro method
32πA simplistic build of simple and slightly more complex macros was implemented directly in Python for the following reasons.
- m4 was used originally. However, itβs less ideal due to the need to chain operations together and a hope of porting this to Windows (the native environment is Linux). More dependencies are problematic.
- m4βs experimental (configuration option) changeword option could be useful
- Pandocβs Lua filters and Luaβs gsub (or similar) function β but this is more challenging because of the need to scan through all Str tokens and possibly split them into at punctuation into
pandoc.Strong()
andpandoc.Str()
.
Choice of Pandoc numbering
33πInitially, Lua filters were used. However, these are slightly harder to manipulate in general for this purpose than directly mangling the Pandoc AST via its JSON export.
34πAlternatives considered and rejected:
- CSS numbering (e.g., via
:before
). Rejected because weβd still have to count through the items to build the value-labels within the cross-referencing. - Using m4 would mean parsing paragraphs to match Pandocβs.
Cross-references (xrefs)
35πA set of directives (see section above) can be inserted into the input files.
36πThe requirement this addresses is to be able to build consistent cross-references across a range of documents. The final destinations of these documents may be across multiple directories (or even servers) despite being built from a single input directory. Additionally, being able trace backwards is valuable to see which documents (and sections or paragraphs within them) are being referred to.
37πIf the capabilities of the v
directive (to show a value rather than build a link) were not needed, then Pandoc spans and inline references may have been viable. However, these identifiers are set too late for a post-processor to handle (or at least would need a restructure); and cross-document referencing would still require some external assistance.
38πAny identifiers set for sections in Pandoc using {β¦} will be preserved.
39πDuplicate cross-references result in errors, as do missing targets/values.
Resolving cross-references
p
β removed entirely β the paragraph number span (or an empty span if visible numbering is disabled) is used as a target. Back-reference markers are also inserted for each inboundv
/a
/h
/b
reference and a link symbol.s
β removed entirely β the section header is used as a target. Back-reference markers are also inserted for each inboundv
/a
/h
/b
reference along with a section sign as a link.t
β replaced with an empty anchor as target (except for a link symbol). Back-reference markers are also inserted for each inbound v/a/b referencev
β replaced with the value of the paragraph or section number, or variable (error if this points to at
target)a
β effectively replaced with the URL in angle brackets<β¦>
. Because pandoc canβt recognise some of the URLs we use as a URL in<β¦>
, this pass instead uses a pair[β¦](β¦){class=a}
to obtain the desired resulth
β similar toa
except the contents of[β¦]
is the value. It is suffixed[β¦](β¦){class=h}
for styingb
β replaced with the URL in parantheses(β¦)
V
,A
,H
,B
β same as for v/a/h/b but doesnβt result in the back-reference markers
40πNote, a single space, LF, or CR-LF after x
, p
, s
or t
are also removed. This makes the original markdown source easier to read by allowing some whitespace. If whitespace is desired after one of these directives, just use (at least) two items of whitespace.
Render HTML
41πThe (near) final pass generates HTML using Pandoc. CSS is applied from the default base.css file and the customisable custom.css file. A further CSS file positions and styles the paragraph numbers. The --self-contained
option is set so that any resources such as images are embedded in the HTML file rather than referenced.
42πThe default structure also includes a trailing block of HTML which uses the script include/git-get-status
to generate a footer with Git commit information.
43πThe body element is also marked with data-dir and data-file attributes to enable the use of per-directory and/or per-file styling.
Dependencies
44πSome dependencies are absolute:
- Pandoc
- Python 3
Recommended
- GNU Make
- Git (version control system)
- Chromium (for PDF output)
Copyright
45πCopyright Β© 2022-2023 Phil Brooke & Green Pike Ltd
46πThis program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License version 3 as published by the Free Software Foundation.
47πThis program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
48πYou should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.