search-convert-procs.tcl

  • last updated 22 hours ago
Constraints
Constraints: committers
 
Constraints: files
Constraints: dates
Update comment

Convert MS Office files to plain text: reuse the approach employed for .pptx to extract text also from .docx and .xlsx, as it is very similar.

Whitespace cleanup

Prefer util::which to retrieve the unzip executable

Implement a conversion from MS pptx to plaintext:

first slides are extracted from the presentation, then everything that is not the content of a text tag is removed.

Reduce hard errors in the search indexer on invalid file content

This change uses util::file_content_check introduced with acs-tcl

5.10.1d9 to detect error situations before external programs are

called, which can lead to unpredictable error messages.

bumped version to 5.10.1d1

  1. … 2 more files in changeset.
Don't trust blindly the mime-type determined by the file extension and try to use the unix command "file" when available

Downgrade as a warning the case when pdfs fail to be converted to text because of password protection

merged changes from the oacs-5-9 branch and resolved conflicts

  1. … 7834 more files in changeset.
Improve robustness of "file delete" operations

  1. … 25 more files in changeset.
- get rid of annying messages "Format a4 is redefined" caused by broken xls2csv

- Don't choke on empty content

- remove duplicate delete command

- add editor hints to keep spaces/tabs in the future more consistent

- prefer utf8 over iso8859

  1. … 66 more files in changeset.
Merging back to HEAD branch oacs-5-8 (using tag vg-merge-oacs-5-8-from-20141027).

  1. … 2547 more files in changeset.
- The C-library function tmpnam() is deprecated since a while. Therefore naviserver has deprecated ns_tmpnam as well.

Therefore we introduce a new function "ad_tmpnam" which requires just a minimal change and uses ns_mktemp.

  1. … 27 more files in changeset.
- added argument "-passing_style" to search::content_filter

to make explicit whether data contains content or a file name.

Previously, text/* meant automatically passing-style string,

leading to missing results when plain files were added

to the content repository

- added filter for text/plain

- fixed filter for text/html

- further clean up

  1. … 1 more file in changeset.
ppthtml is now catppt, part of catdoc package on most distros

Added converter for open docs. Removed path for command passed to exec. Removed unused proc choice_bar

  1. … 1 more file in changeset.
Add better mime type matching in conversion. Add utility for

powerpoint conversion.

commiting search work from sloan

    • -0
    • +66
    ./search-convert-procs.tcl
  1. … 41 more files in changeset.