A PDF arrives, a cron job checks once a minute if new files have arrived, it finds the file and starts a little processing chain.
### Process the file ###
# OCR the document
ocrmypdf -l deu "$entry" "$OUTPUT_DIR${entry##*/}"
# extract all text of the pdf to a text file
pdf2txt -o "$OUTPUT_DIR${entry##*/}.txt" "$OUTPUT_DIR${entry##*/}"
# save thumbnails of the pages of the pdf
convert "$entry" -quality 30 "$OUTPUT_DIR${entry##*/}.jpg"
What I am using is OCRMyPDF to OCR the incoming file, it writes a new output file that has text underlay.
With pdf2txt the text is exported and all blanks and empty lines are removed, as to have one string of text for a file.
Using ImageMagick, thumbnails of each pdf page are prepared, with a low but readable quality
This result is stored in an archive directory structure, together with a JSON file containing the most important information. This little JSON file is created directly in the shell script. Low tech indeed.
# Assemble the thumbnail subjason
SEARCH="$OUTPUT_DIR${entry##*/}"
for JPGFILE in $SEARCH*.jpg; do
THUMBNAILS="$THUMBNAILS{\"imgname\" : \"${JPGFILE##*/}\",\"imdirectory\" : \"/$OUTPUT_DIR\"},"
done
## strip the last ","
THUMBNAILS=${THUMBNAILS::${#THUMBNAILS}-1}
JSON="{
\"document\" : {
\"name\" : \"$NAME\",
\"directoy\" : \"$OUTPUT_DIR\",
\"text\" : \"$PDFTXT\",
\"timestamp\" : \"$YEAR-$MONTH-$DAY-$HOUR-$MINUTE\",
\"origin\" : \"SCAN\",
\"thumbnails\" : [
$THUMBNAILS
],
\"tags\" : [
{
\"tagname\" : \"SCANNED\"
}
]
}
}"
echo "$JSON" > "$OUTPUT_DIR${entry##*/}.json"
A .1-file is written in the import directory to mark a pdf as processed. I am planning to use the JSON file later as a document in Elastic Search.