• last updated 15 hours ago
Constraints
Constraints: committers
 
Constraints: files
Constraints: dates
Convert MS Office files to plain text: reuse the approach employed for .pptx to extract text also from .docx and .xlsx, as it is very similar.

Make the test more verbose when there is failure: show the extracted text in this case

Fix test and proc:

- make package_id mandatory for search::dotlrn::get_community_id, as it will just fail when this proc is used under dotlrn otherwise

- fix the test by providing a package_id according to the expected behavior

Test leftover api

Test queuing, dequeuing

Test extra arg api from the search package

file test.docx was initially added on branch oacs-5-10.

file test.ppt was initially added on branch oacs-5-10.

    • binary
    ./test/data/test.ppt
file test.pdf was initially added on branch oacs-5-10.

    • binary
    ./test/data/test.pdf
file test.ott was initially added on branch oacs-5-10.

file test.ots was initially added on branch oacs-5-10.

file test.otp was initially added on branch oacs-5-10.

file test.odt was initially added on branch oacs-5-10.

file test.ods was initially added on branch oacs-5-10.

file test.odp was initially added on branch oacs-5-10.

file test.html was initially added on branch oacs-5-10.

file test.xlsx was initially added on branch oacs-5-10.

file test.xls was initially added on branch oacs-5-10.

    • binary
    ./test/data/test.xls
file test.txt was initially added on branch oacs-5-10.

file test.pptx was initially added on branch oacs-5-10.

Test binary to text conversion of various file types

file search-procs.tcl was initially added on branch oacs-5-10.

    • -0
    • +0
    ./test/search-procs.tcl
file test.doc was initially added on branch oacs-5-10.

    • binary
    ./test/data/test.doc
Whitespace cleanup

Make use of new API "ad_mktmpdir" and "ad_opentmpfile" instead of "ad_tmpnam"

Prefer util::which to retrieve the unzip executable

Complain in the logfile whenever the insertion of the null character is attempted in the syndication table

Implement a conversion from MS pptx to plaintext:

first slides are extracted from the presentation, then everything that is not the content of a text tag is removed.

More targeted sanityzing only on the variables that have a chance to contain the null character

Translate potential null characters in the syndication content with the empty string, so that we do not risk to try (and fail) to insert them in the database