Apache Tika 1.14 发布 ,内容抽取工具集合

来源:开源中国社区 作者:局长
  

Apache Tika 1.14 发布了,该版本包含了一些改进和 Bug 修复。Tika 是一个内容抽取的工具集合(a toolkit for text extracting)。它集成了 POI, Pdfbox 并且为文本抽取工作提供了一个统一的界面。其次,Tika 也提供了便利的扩展 API,用来丰富其对第三方文件格式的支持。

更新如下:

  • Extract all headers from MSG/RFC822 (TIKA-2122).

  • Upgrade metadata-extractor to 2.9.1 (TIKA-2113).

  • Extract PDF DocInfo metadata into separate keys to prevent overwriting by XMP metadata (TIKA-2057).

  • Re-enable fileUrl for tika-server (TIKA-2081).  If you choose,to use this feature, beware of the security vulnerabilities!See: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-3271

  • Add Tesseract's hOCR output format as an option, via Eric Pugh(TIKA-2093)

  • Extract macros from MSOffice files (TIKA-2069).

  • Maintain passed-in mime in TXTParser (TIKA-2047).

  • Upgrade to POI.3-15 (TIKA-2013).

  • Upgrade to PDFBox 2.0.3 (TIKA-2051).

  • Fix hyperlinks with formatting in DOC and DOCX (TIKA-1255 and TIKA-2078)

  • Tika now is integrated with the Tensorflow library from Google and it can use its Inception v3 image classification model to identify objects in images (TIKA-1993).

  • Parser configuration is now type-safe and parameters for parsers can have assigned types (TIKA-1508, TIKA-1986).

  • Prevent OOM/permanent hang on some corrupt CHM files (TIKA-2040).

  • Upgrade ICU4J charset detection components to fix multithreading bug (TIKA-2041).

  • Upgrade to Jackcess 2.1.4 (TIKA-2039).

  • Maintain more significant digits in cells of "General" format in XLS and XLSX (TIKA-2025).

  • Avoid mark/reset issues when extracting or detecting embedded resources in RFC822 emails (TIKA-2037).

  • Improving accuracy of Tesseract for better extraction of numeric and alphanumeric text from images (TIKA-2021, TIKA-2031).

  • Improve extraction of embedded documents from PPT, PPTX and XLSX(TIKA-2026).

  • Add parser for applefile (AppleSingle) (TIKA-2022).

  • Add mime types, mime magic and/or globs for:

    • Endnote Import File (TIKA-2011)

    • DJVU files (TIKA-2009)

    • MS Owner File (TIKA-2008)

    • Windows Media Metafile (TIKA-2004)

    • iCal and vCalendar (TIKA-2006)

    • MBOX (TIKA-2042)

    • Stata DTA (TIKA-2064)

  • Add configurable maximum threshold for number of events extracted from the XMP Media Management Schema in JempboxExtractor (TIKA-1999).

  • Integrate TesseractOCR with full page image rendering for PDFs (TIKA-1994).

  • Add mime detection via Nick C and parser for DBF files (TIKA-1513).

  • Add mime detection and parsers for MSOffice 2003 XML Word and Excel formats (TIKA-1958).

  • Extract hyperlinks from PPT, PPTX, XSLX (TIKA-1454).

  • Upgrade to Commons Compress 1.12 (supports progress on TIKA-1358)

发布说明完整更新内容

下载地址:


时间:2016-11-11 08:35 来源:开源中国社区 作者:局长 原文链接

好文,顶一下
(0)
0%
文章真差,踩一下
(0)
0%
------分隔线----------------------------


把开源带在你的身边-精美linux小纪念品
无觅相关文章插件,快速提升流量