优米格
分享有营养的

Java实现Elasticsearch对word/pdf文件创建索引

elasticsearch.jpg

环境

  • Elasticsearch 7.1.1;
  • ingest-attachment-7.1.1;
  • CENTOS 7;

这里记录了使用Java将上传附件索引到Elasticsearch的实现方式。

前提

对文档的处理需借助Elastcisearch ingest-attachment插件,准备工作详见下列文章:

实现

 public String indexAttachmentToElasticSearch(String fileFullPath) {
        String result = "error";
        InputStream inputStream;
        IndexRequest request;
        try {
            inputStream = new FileInputStream(new File(fileFullPath));
            byte[] fileByteStream = IOUtils.toByteArray(inputStream);
            String base64String = new String(Base64.getEncoder().encodeToString(fileByteStream).getBytes(), "UTF-8");
            inputStream.close();
            Map attachmentMap = new HashMap();
            attachmentMap.put("data", base64String);
            attachmentMap.put("fileName", "四个空格-https://www.yomige.org");
            String jsonString = JSONObject.toJSONString(attachmentMap);
            request = new IndexRequest("data_archives_attachment");
            request.id(UUID.randomUUID().toString());
            request.setPipeline("single_attachment");
            request.source(jsonString, XContentType.JSON);
            IndexResponse response = client.index(request, RequestOptions.DEFAULT);
            result = response.status().toString();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (ElasticsearchException e) {
            e.getDetailedMessage();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return result;
    }

上面代码中,single_attachment为管道流名称,data_archives_attachment为索引名称。

具体实现是借助Elastcisearch ingest-attachment插件处理BASE64字符串,文件转为BASE64字符串的过程为:

inputStream = new FileInputStream(new File(fileFullPath));
byte[] fileByteStream = IOUtils.toByteArray(inputStream);
String base64String = new String(Base64.getEncoder().encodeToString(fileByteStream).getBytes(), "UTF-8");

完整代码示例:https://github.com/aitlp/elastic

赞(0)
未经允许禁止转载:优米格 » Java实现Elasticsearch对word/pdf文件创建索引

评论 1

  1. #1

    这样的话数据库里的结构为{“attachment”:{“content”:”xxx”}},
    要怎么才能变成{“content”:”xxx”},从而使结构保持一致?

    poi4年前 (2020-10-13)回复

合作&反馈&投稿

商务合作、问题反馈、投稿,欢迎联系

广告合作侵权联系