性欧美video另类hd亚洲人,97国产精品欧美一区二区三区

本文將探討 Apache Lucene —— 性能卓越、功能全面的文本搜索引擎庫(kù)。我們將學(xué)習(xí) Lucene 架構(gòu)及其核心 API。學(xué)習(xí)如何使用 Lucene 進(jìn)行跨平臺(tái)全文本搜索、建立索引、顯示結(jié)果，以及如何擴(kuò)展搜索。

Lucene 是一個(gè)開(kāi)源、高度可擴(kuò)展的搜索引擎庫(kù)，可以從 Apache Software Foundation 獲取。您可以將 Lucene 用于商業(yè)和開(kāi)源應(yīng)用程序。Lucene 強(qiáng)大的 API 主要關(guān)注文本索引和搜索。它可以用于為各種應(yīng)用程序構(gòu)建搜索功能，比如電子郵件客戶端、郵件列表、Web 搜索、數(shù)據(jù)庫(kù)搜索等等。Wikipedia、TheServerSide、jGuru 和 LinkedIn 等網(wǎng)站都使用了 Lucene。

Lucene 還為 Eclipse IDE、Nutch（著名的開(kāi)源 Web 搜索引擎）以及 IBM®、AOL 和 Hewlett-Packard 等公司提供搜索功能。Lucene 已經(jīng)兼容許多其他編程語(yǔ)言，包括 Perl、Python、C++ 和 .NET。到 2009 年 7 月 30 日止，用于 Java™ 編程語(yǔ)言的最新版 Lucene 為 V2.4.1。

Lucene 功能眾多：

擁有強(qiáng)大、準(zhǔn)確、有效的搜索算法。計(jì)算每個(gè)文檔匹配給定查詢的分?jǐn)?shù)，并根據(jù)分?jǐn)?shù)返回最相關(guān)的文檔。支持許多強(qiáng)大的查詢類型，比如 PhraseQuery、WildcardQuery、RangeQuery、FuzzyQuery、BooleanQuery 等。支持解析人們輸入的豐富查詢表達(dá)式。允許用戶使用定制排序、過(guò)濾和查詢表達(dá)式解析擴(kuò)展搜索行為。使用基于文件的鎖定機(jī)制保護(hù)并發(fā)索引修改。允許同時(shí)搜索和編制索引。

如圖 1 所示，使用 Lucene 構(gòu)建功能全面的搜索應(yīng)用程序主要涉及編制數(shù)據(jù)索引、搜索數(shù)據(jù)和顯示搜索結(jié)果幾個(gè)方面。

本文從使用 Lucene V2.4.1 和 Java 技術(shù)開(kāi)發(fā)的樣例應(yīng)用程序中挑選了一些代碼片段。示例應(yīng)用程序?yàn)榇鎯?chǔ)在屬性文件中一組電子郵件文檔編制索引，并展示了如何使用 Lucene 的查詢 API 搜索索引。該示例還讓您熟悉基本的索引操作。

Lucene 允許您為任何文本格式的數(shù)據(jù)編制索引。Lucene 可以用于幾乎任何數(shù)據(jù)源以及從中提取的文本信息。您可以使用 Lucene 編制索引并搜索 HTML 文檔、Microsoft® Word 文檔、PDF 文件中存儲(chǔ)的數(shù)據(jù)。編制數(shù)據(jù)索引的第一步是讓數(shù)據(jù)變成一個(gè)簡(jiǎn)單的文本格式。您可以使用定制解析器和數(shù)據(jù)轉(zhuǎn)換器實(shí)現(xiàn)這一點(diǎn)。

編制索引是將文本數(shù)據(jù)轉(zhuǎn)換為有利于快速搜索的格式。這類似于書(shū)本后面的索引：為您指出主題在書(shū)中出現(xiàn)的位置。

Lucene 將輸入數(shù)據(jù)存儲(chǔ)在名為逆序索引的數(shù)據(jù)結(jié)構(gòu)中, 該數(shù)據(jù)結(jié)構(gòu)以索引文件集的形式存儲(chǔ)在文件系統(tǒng)或內(nèi)存中。大部分 Web 搜索引擎都使用逆序索引。它允許用戶執(zhí)行快速關(guān)鍵字查詢，查找匹配給定查詢的文檔。在將文本數(shù)據(jù)添加到索引前，由分析程序（使用分析過(guò)程）進(jìn)行處理。

分析是將文本數(shù)據(jù)轉(zhuǎn)換為搜索基本單位（稱為項(xiàng)（term））的過(guò)程。在分析過(guò)程中，文本數(shù)據(jù)將經(jīng)歷多項(xiàng)操作：提取單詞、移除通用單詞、忽略標(biāo)點(diǎn)符號(hào)、將單詞變?yōu)樵~根形式、將單詞變成小寫(xiě)等等。分析過(guò)程發(fā)生在編制索引和查詢解析之前。分析將文本數(shù)據(jù)轉(zhuǎn)換為標(biāo)記，這些標(biāo)記將作為項(xiàng)添加到 Lucene 索引中。

Lucene 有多種內(nèi)置分析程序，比如 SimpleAnalyzer、StandardAnalyzer、StopAnalyzer、SnowballAnalyzer 等。它們?cè)跇?biāo)記文本和應(yīng)用過(guò)濾器的方式上有所區(qū)別。因?yàn)榉治鲈诰幹扑饕耙瞥龁卧~，它減少了索引的大小，但是不利用精確的查詢過(guò)程。您可以使用 Lucene 提供的基本構(gòu)建塊創(chuàng)建定制分析程序，以自己的方式控制分析過(guò)程。表 1 展示了一些內(nèi)置分析程序及其處理數(shù)據(jù)的方式。

分析程序分解空白處的標(biāo)記

WhitespaceAnalyzer 分解非字母字符的文本，并將文本轉(zhuǎn)為小寫(xiě)形式

SimpleAnalyzer 分解非字母字符的文本，并將文本轉(zhuǎn)為小寫(xiě)形式

StopAnalyzer 移除虛字（stop word）—— 對(duì)檢索無(wú)用的字，并將文本轉(zhuǎn)為小寫(xiě)形式

StandardAnalyzer 根據(jù)一種復(fù)雜語(yǔ)法（識(shí)別電子郵件地址、縮寫(xiě)、中文、日文、韓文字符、字母數(shù)字等等）標(biāo)記文本將文本轉(zhuǎn)為小寫(xiě)形式移除虛字

Directory表示索引文件存儲(chǔ)位置的抽象類。有兩個(gè)常用的子類： FSDirectory — 在實(shí)際文件系統(tǒng)中存儲(chǔ)索引的 Directory 實(shí)現(xiàn)。該類對(duì)于大型索引非常有用。 RAMDirectory — 在內(nèi)存中存儲(chǔ)所有索引的實(shí)現(xiàn)。該類適用于較小的索引，可以完整加載到內(nèi)存中，在應(yīng)用程序終止之后銷毀。由于索引保存在內(nèi)存中，所以速度相對(duì)較快。

Analyzer正如上文所述，分析程序負(fù)責(zé)處理文本數(shù)據(jù)并將其轉(zhuǎn)換為標(biāo)記存儲(chǔ)在索引中。在編制索引前，IndexWriter 接收用于標(biāo)記數(shù)據(jù)的分析程序。要為文本編制索引，您應(yīng)該使用適用于該文本語(yǔ)言的分析程序。

默認(rèn)分析程序適用于英語(yǔ)。在 Lucene 沙盒中還有其他分析程序，包括用于中文、日文和韓文的分析程序。

IndexDeletionPolicy該接口用來(lái)實(shí)現(xiàn)從索引目錄中定制刪除過(guò)時(shí)提交的策略。默認(rèn)刪除策略是 KeepOnlyLastCommitDeletionPolicy，該策略僅保留最近的提交，并在完成一些提交之后立即移除所有之前的提交。

IndexWriter創(chuàng)建或維護(hù)索引的類。它的構(gòu)造函數(shù)接收布爾值，確定是否創(chuàng)建新索引，或者打開(kāi)現(xiàn)有索引。它提供在索引中添加、刪除和更新文檔的方法。

對(duì)索引所做的更改最初緩存在內(nèi)存中，并周期性轉(zhuǎn)儲(chǔ)到索引目錄。IndexWriter 公開(kāi)了幾個(gè)控制如何在內(nèi)存中緩存索引并寫(xiě)入磁盤(pán)的字段。對(duì)索引的更改對(duì)于 IndexReader 不可見(jiàn)，除非調(diào)用 IndexWriter 的提交或關(guān)閉方法。IndexWriter 創(chuàng)建一個(gè)目錄鎖定文件，以通過(guò)同步索引更新保護(hù)索引不受破壞。IndexWriter 允許用戶指定可選索引刪除策略。

Java代碼

//Create instance of Directory where index files will be stored

Directory fsDirectory = FSDirectory.getDirectory(indexDirectory);

/* Create instance of analyzer, which will be used to tokenize

the input data */

Analyzer standardAnalyzer = new StandardAnalyzer();

//Create a new index

boolean create = true;

//Create the instance of deletion policy

IndexDeletionPolicy deletionPolicy = new KeepOnlyLastCommitDeletionPolicy();

indexWriter =new IndexWriter(fsDirectory,standardAnalyzer,create,

deletionPolicy,IndexWriter.MaxFieldLength.UNLIMITED);

//Create instance of Directory where index files will be storedDirectory fsDirectory = FSDirectory.getDirectory(indexDirectory);/* Create instance of analyzer, which will be used to tokenizethe input data */Analyzer standardAnalyzer = new StandardAnalyzer();//Create a new indexboolean create = true;//Create the instance of deletion policyIndexDeletionPolicy deletionPolicy = new KeepOnlyLastCommitDeletionPolicy();indexWriter =new IndexWriter(fsDirectory,standardAnalyzer,create,deletionPolicy,IndexWriter.MaxFieldLength.UNLIMITED);

將文本數(shù)據(jù)添加到索引涉及到兩個(gè)類。

Field 表示搜索中查詢或檢索的數(shù)據(jù)片。Field 類封裝一個(gè)字段名稱及其值。Lucene 提供了一些選項(xiàng)來(lái)指定字段是否需要編制索引或分析，以及值是否需要存儲(chǔ)。這些選項(xiàng)可以在創(chuàng)建字段實(shí)例時(shí)傳遞。下表展示了 Field 元數(shù)據(jù)選項(xiàng)的詳細(xì)信息。

選項(xiàng) 描述

Field.Store.Yes 用于存儲(chǔ)字段值。適用于顯示搜索結(jié)果的字段 — 例如，文件路徑和 URL。

Field.Store.No 沒(méi)有存儲(chǔ)字段值 — 例如，電子郵件消息正文。

Field.Index.No 適用于未搜索的字段 — 僅用于存儲(chǔ)字段，比如文件路徑。

Field.Index.ANALYZED 用于字段索引和分析 — 例如，電子郵件消息正文和標(biāo)題。

Field.Index.NOT_ANALYZED 用于編制索引但不分析的字段。它在整體中保留字段的原值 — 例如，日期和個(gè)人名稱。

Document 是一個(gè)字段集合。Lucene 也支持推進(jìn)文檔和字段，這在給某些索引數(shù)據(jù)賦予重要性時(shí)非常有用。給文本文件編制索引包括將文本數(shù)據(jù)封裝在字段中、創(chuàng)建文檔、填充字段，使用 IndexWriter 向索引添加文檔。

列表 2 展示向索引添加數(shù)據(jù)的示例。

Java代碼

*Step 1. Prepare the data for indexing. Extract the data. */

String sender = properties.getProperty("sender");

String date = properties.getProperty("date");

String subject = properties.getProperty("subject");

String message = properties.getProperty("message");

String emaildoc = file.getAbsolutePath();

/* Step 2. Wrap the data in the Fields and add them to a Document */

Field senderField =

new Field("sender",sender,Field.Store.YES,Field.Index.NOT_ANALYZED);

Field emaildatefield =

new Field("date",date,Field.Store.NO,Field.Index.NOT_ANALYZED);

Field subjectField =

new Field("subject",subject,Field.Store.YES,Field.Index.ANALYZED);

Field messagefield =

new Field("message",message,Field.Store.NO,Field.Index.ANALYZED);

Field emailDocField =

new Field("emailDoc",emaildoc,Field.Store.YES,

Field.Index.NO);

Document doc = new Document();

// Add these fields to a Lucene Document

doc.add(senderField);

doc.add(emaildatefield);

doc.add(subjectField);

doc.add(messagefield);

doc.add(emailDocField);

//Step 3: Add this document to Lucene Index.

indexWriter.addDocument(doc);

*Step 1. Prepare the data for indexing. Extract the data. */String sender = properties.getProperty("sender");String date = properties.getProperty("date");String subject = properties.getProperty("subject");String message = properties.getProperty("message");String emaildoc = file.getAbsolutePath();/* Step 2. Wrap the data in the Fields and add them to a Document */Field senderField =new Field("sender",sender,Field.Store.YES,Field.Index.NOT_ANALYZED);Field emaildatefield =new Field("date",date,Field.Store.NO,Field.Index.NOT_ANALYZED);Field subjectField =new Field("subject",subject,Field.Store.YES,Field.Index.ANALYZED);Field messagefield =new Field("message",message,Field.Store.NO,Field.Index.ANALYZED);Field emailDocField =new Field("emailDoc",emaildoc,Field.Store.YES,Field.Index.NO);Document doc = new Document();// Add these fields to a Lucene Documentdoc.add(senderField);doc.add(emaildatefield);doc.add(subjectField);doc.add(messagefield);doc.add(emailDocField);//Step 3: Add this document to Lucene Index.indexWriter.addDocument(doc);

搜索是在索引中查找單詞并查找包含這些單詞的文檔的過(guò)程。使用 Lucene 的搜索 API 構(gòu)建的搜索功能非常簡(jiǎn)單明了。本小節(jié)討論 Lucene 搜索 API 的主要類。

Searcher 是一個(gè)抽象基類，包含各種超負(fù)荷搜索方法。IndexSearcher 是一個(gè)常用的子類，允許在給定的目錄中存儲(chǔ)搜索索引。Search 方法返回一個(gè)根據(jù)計(jì)算分?jǐn)?shù)排序的文檔集合。Lucene 為每個(gè)匹配給定查詢的文檔計(jì)算分?jǐn)?shù)。IndexSearcher 是線程安全的；一個(gè)實(shí)例可以供多個(gè)線程并發(fā)使用。

Term 是搜索的基本單位。它由兩部分組成：?jiǎn)卧~文本和出現(xiàn)該文本的字段的名稱。Term 對(duì)象也涉及索引編制，但是可以在 Lucene 內(nèi)部創(chuàng)建。

Query 是一個(gè)用于查詢的抽象基類。搜索指定單詞或詞組涉及到在項(xiàng)中包裝它們，將項(xiàng)添加到查詢對(duì)象，將查詢對(duì)象傳遞到 IndexSearcher 的搜索方法。

Lucene 包含各種類型的具體查詢實(shí)現(xiàn)，比如 TermQuery、BooleanQuery、PhraseQuery、PrefixQuery、RangeQuery、MultiTermQuery、FilteredQuery、SpanQuery 等。以下部分討論 Lucene 查詢 API 的主查詢類。

TermQuery搜索索引最基本的查詢類型?？梢允褂脝蝹€(gè)項(xiàng)構(gòu)建 TermQuery。項(xiàng)值應(yīng)該區(qū)分大小寫(xiě)，但也并非全是如此。注意，傳遞的搜索項(xiàng)應(yīng)該與文檔分析得到的項(xiàng)一致，因?yàn)榉治龀绦蛟跇?gòu)建索引之前對(duì)原文本執(zhí)行許多操作。

例如，考慮電子郵件標(biāo)題 “Job openings for Java Professionals at Bangalore”。假設(shè)您使用 StandardAnalyzer 編制索引?，F(xiàn)在如果我們使用 TermQuery 搜索 “Java”，它不會(huì)返回任何內(nèi)容，因?yàn)楸疚谋緫?yīng)該已經(jīng)規(guī)范化，并通過(guò) StandardAnalyzer 轉(zhuǎn)成小寫(xiě)。如果搜索小寫(xiě)單詞 “java”，它將返回所有標(biāo)題字段中包含該單詞的郵件。

Java代碼

//Search mails having the word "java" in the subject field

Searcher indexSearcher = new IndexSearcher(indexDirectory);

Term term = new Term("subject","java");

Query termQuery = new TermQuery(term);

TopDocs topDocs = indexSearcher.search(termQuery,10);

//Search mails having the word "java" in the subject fieldSearcher indexSearcher = new IndexSearcher(indexDirectory);Term term = new Term("subject","java");Query termQuery = new TermQuery(term);TopDocs topDocs = indexSearcher.search(termQuery,10);

RangeQuery您可以使用 RangeQuery 在某個(gè)范圍內(nèi)搜索。索引中的所有項(xiàng)都以字典順序排列。Lucene 的 RangeQuery 允許用戶在某個(gè)范圍內(nèi)搜索項(xiàng)。該范圍可以使用起始項(xiàng)和最終項(xiàng)（包含兩端或不包含兩端均可）指定。

Java代碼

/* RangeQuery example:Search mails from 01/06/2009 to 6/06/2009

both inclusive */

Term begin = new Term("date","20090601");

Term end = new Term("date","20090606");

Query query = new RangeQuery(begin, end, true);

/* RangeQuery example:Search mails from 01/06/2009 to 6/06/2009both inclusive */Term begin = new Term("date","20090601");Term end = new Term("date","20090606");Query query = new RangeQuery(begin, end, true);

PrefixQuery您可以使用 PrefixQuery 通過(guò)前綴單詞進(jìn)行搜索，該方法用于構(gòu)建一個(gè)查詢，該查詢查找包含以指定單詞前綴開(kāi)始的詞匯的文檔。

Java代碼

//Search mails having sender field prefixed by the word 'job'

PrefixQuery prefixQuery = new PrefixQuery(new Term("sender","job"));

PrefixQuery query = new PrefixQuery(new Term("sender","job"));

//Search mails having sender field prefixed by the word 'job'PrefixQuery prefixQuery = new PrefixQuery(new Term("sender","job"));PrefixQuery query = new PrefixQuery(new Term("sender","job"));

BooleanQuery您可以使用 BooleanQuery 組合任何數(shù)量的查詢對(duì)象，構(gòu)建強(qiáng)大的查詢。它使用 query 和一個(gè)關(guān)聯(lián)查詢的子句，指示查詢是應(yīng)該發(fā)生、必須發(fā)生還是不得發(fā)生。在 BooleanQuery 中，子句的最大數(shù)量默認(rèn)限制為 1,024。您可以調(diào)用 setMaxClauseCount 方法設(shè)置最大子句數(shù)。

Java代碼

// Search mails have both 'java' and 'bangalore' in the subject field

Query query1 = new TermQuery(new Term("subject","java"));

Query query2 = new TermQuery(new Term("subject","bangalore"));

BooleanQuery query = new BooleanQuery();

query.add(query1,BooleanClause.Occur.MUST);

query.add(query2,BooleanClause.Occur.MUST);

// Search mails have both 'java' and 'bangalore' in the subject fieldQuery query1 = new TermQuery(new Term("subject","java"));Query query2 = new TermQuery(new Term("subject","bangalore"));BooleanQuery query = new BooleanQuery();query.add(query1,BooleanClause.Occur.MUST);query.add(query2,BooleanClause.Occur.MUST);

PhraseQuery您可以使用 PhraseQuery 進(jìn)行短語(yǔ)搜索。PhraseQuery 匹配包含特定單詞序列的文檔。PhraseQuery 使用索引中存儲(chǔ)的項(xiàng)的位置信息?？紤]匹配的項(xiàng)之間的距離稱為 slop。默認(rèn)情況下，slop 的值為零，這可以通過(guò)調(diào)用 setSlop 方法進(jìn)行設(shè)置。PhraseQuery 還支持多個(gè)項(xiàng)短語(yǔ)。

Java代碼

/* PhraseQuery example: Search mails that have phrase 'job opening j2ee'

in the subject field.*/

PhraseQuery query = new PhraseQuery();

query.setSlop(1);

query.add(new Term("subject","job"));

query.add(new Term("subject","opening"));

query.add(new Term("subject","j2ee"));

/* PhraseQuery example: Search mails that have phrase 'job opening j2ee'in the subject field.*/PhraseQuery query = new PhraseQuery();query.setSlop(1);query.add(new Term("subject","job"));query.add(new Term("subject","opening"));query.add(new Term("subject","j2ee"));

WildcardQueryWildcardQuery 實(shí)現(xiàn)通配符搜索查詢，這允許您搜索 arch*（可以查找包含 architect、architecture 等）之類的單詞。使用兩個(gè)標(biāo)準(zhǔn)通配符： * 表示零個(gè)以上 ? 表示一個(gè)以上

如果使用以通配符查詢開(kāi)始的模式進(jìn)行搜索，則可能會(huì)引起性能的降低，因?yàn)檫@需要查詢索引中的所有項(xiàng)以查找匹配文檔。

Java代碼

//Search for 'arch*' to find e-mail messages that have word 'architect' in the subject

field./

Query query = new WildcardQuery(new Term("subject","arch*"));

//Search for 'arch*' to find e-mail messages that have word 'architect' in the subjectfield./Query query = new WildcardQuery(new Term("subject","arch*"));

FuzzyQuery您可以使用 FuzzyQuery 搜索類似項(xiàng)，該類匹配類似于指定單詞的單詞。類似度測(cè)量基于 Levenshtein（編輯距離）算法進(jìn)行。在列表 9 中，F(xiàn)uzzyQuery 用于查找與拼錯(cuò)的單詞 “admnistrtor” 最接近的項(xiàng)，盡管這個(gè)錯(cuò)誤單詞沒(méi)有索引。

Java代碼

/* Search for emails that have word similar to 'admnistrtor' in the

subject field. Note we have misspelled admnistrtor here.*/

Query query = new FuzzyQuery(new Term("subject", "admnistrtor"));

/* Search for emails that have word similar to 'admnistrtor' in thesubject field. Note we have misspelled admnistrtor here.*/Query query = new FuzzyQuery(new Term("subject", "admnistrtor"));

QueryParserQueryParser 對(duì)于解析人工輸入的查詢字符非常有用。您可以使用它將用戶輸入的查詢表達(dá)式解析為 Lucene 查詢對(duì)象，這些對(duì)象可以傳遞到 IndexSearcher 的搜索方法。它可以解析豐富的查詢表達(dá)式。 QueryParser 內(nèi)部將人們輸入的查詢字符串轉(zhuǎn)換為一個(gè)具體的查詢子類。您需要使用反斜杠（\）將 *、? 等特殊字符進(jìn)行轉(zhuǎn)義。您可以使用運(yùn)算符 AND、OR 和 NOT 構(gòu)建文本布爾值查詢。

Java代碼

QueryParser queryParser = new QueryParser("subject",new StandardAnalyzer());

// Search for emails that contain the words 'job openings' and '.net' and 'pune'

Query query = queryParser.parse("job openings AND .net AND pune");

QueryParser queryParser = new QueryParser("subject",new StandardAnalyzer());// Search for emails that contain the words 'job openings' and '.net' and 'pune'Query query = queryParser.parse("job openings AND .net AND pune");

IndexSearcher 返回一組對(duì)分級(jí)搜索結(jié)果（如匹配給定查詢的文檔）的引用。您可以使用 IndexSearcher 的搜索方法確定需要檢索的最優(yōu)先搜索結(jié)果數(shù)量?？梢栽诖嘶A(chǔ)上構(gòu)建定制分頁(yè)。您可以添加定制 Web 應(yīng)用程序或桌面應(yīng)用程序來(lái)顯示搜索結(jié)果。檢索搜索結(jié)果涉及的主要類包括 ScoreDoc 和 TopDocs。

ScoreDoc搜索結(jié)果中包含一個(gè)指向文檔的簡(jiǎn)單指針。這可以封裝文檔索引中文檔的位置以及 Lucene 計(jì)算的分?jǐn)?shù)。TopDocs封裝搜索結(jié)果以及 ScoreDoc 的總數(shù)。

以下代碼片段展示了如何檢索搜索結(jié)果中包含的文檔。

Java代碼

/* First parameter is the query to be executed and

second parameter indicates the no of search results to fetch */

TopDocs topDocs = indexSearcher.search(query,20);

System.out.println("Total hits "+topDocs.totalHits);

// Get an array of references to matched documents

ScoreDoc[] scoreDosArray = topDocs.scoreDocs;

for(ScoreDoc scoredoc: scoreDosArray){

//Retrieve the matched document and show relevant details

Document doc = indexSearcher.doc(scoredoc.doc);

System.out.println("\nSender: "+doc.getField("sender").stringValue());

System.out.println("Subject: "+doc.getField("subject").stringValue());

System.out.println("Email file location: "

+doc.getField("emailDoc").stringValue());

}

/* First parameter is the query to be executed andsecond parameter indicates the no of search results to fetch */TopDocs topDocs = indexSearcher.search(query,20);System.out.println("Total hits "+topDocs.totalHits);// Get an array of references to matched documentsScoreDoc[] scoreDosArray = topDocs.scoreDocs;for(ScoreDoc scoredoc: scoreDosArray){//Retrieve the matched document and show relevant detailsDocument doc = indexSearcher.doc(scoredoc.doc);System.out.println("\nSender: "+doc.getField("sender").stringValue());System.out.println("Subject: "+doc.getField("subject").stringValue());System.out.println("Email file location: "+doc.getField("emailDoc").stringValue());}

基本的索引操作包括移除和提升文檔。

應(yīng)用程序常常需要使用最新的數(shù)據(jù)更新索引并移除較舊的數(shù)據(jù)。例如，在 Web 搜索引擎中，索引需要定期更新，因?yàn)榭偸切枰砑有戮W(wǎng)頁(yè)，移除不存在的網(wǎng)頁(yè)。Lucene 提供了 IndexReader 接口允許您對(duì)索引執(zhí)行這些操作。

IndexReader 是一個(gè)提供各種方法訪問(wèn)索引的抽象類。Lucene 內(nèi)部引用文檔時(shí)使用文檔編號(hào)，該編號(hào)可以在向索引添加或從中移除文檔時(shí)更改。文檔編號(hào)用于訪問(wèn)索引中的文檔。IndexReader 不得用于更新目錄中的索引，因?yàn)橐呀?jīng)打開(kāi)了 IndexWriter。IndexReader 在打開(kāi)時(shí)總是搜索索引的快照。對(duì)索引的任何更改都可以看到，直到再次打開(kāi) IndexReader。使用 Lucene 重新打開(kāi)它們的 IndexReader 可以看到最新的索引更新。

Java代碼

// Delete all the mails from the index received in May 2009.

IndexReader indexReader = IndexReader.open(indexDirectory);

indexReader.deleteDocuments(new Term("month","05"));

//close associate index files and save deletions to disk

indexReader.close();

// Delete all the mails from the index received in May 2009.IndexReader indexReader = IndexReader.open(indexDirectory);indexReader.deleteDocuments(new Term("month","05"));//close associate index files and save deletions to diskindexReader.close();

有時(shí)您需要給某些索引數(shù)據(jù)更高的重要級(jí)別。您可以通過(guò)設(shè)置文檔或字段的提升因子實(shí)現(xiàn)這一點(diǎn)。默認(rèn)情況下，所有文檔和字段的默認(rèn)提升因子都是 1.0。

Java代碼

if(subject.toLowerCase().indexOf("pune") != -1){

// Display search results that contain pune in their subject first by setting boost factor

subjectField.setBoost(2.2F);

}

//Display search results that contain 'job' in their sender email address

if(sender.toLowerCase().indexOf("job")!=-1){

luceneDocument.setBoost(2.1F);

}

if(subject.toLowerCase().indexOf("pune") != -1){// Display search results that contain pune in their subject first by setting boost factorsubjectField.setBoost(2.2F);}//Display search results that contain 'job' in their sender email addressif(sender.toLowerCase().indexOf("job")!=-1){luceneDocument.setBoost(2.1F);}

Lucene 提供一個(gè)稱為排序的高級(jí)功能。您可以根據(jù)指示文檔在索引中相對(duì)位置的字段對(duì)搜索結(jié)果進(jìn)行排序。用于排序的字段必須編制索引但不得標(biāo)記。搜索字段中可以放入 4 種可能的項(xiàng)值：整數(shù)值、long 值、浮點(diǎn)值和字符串。

還可以通過(guò)索引順序排序搜索結(jié)果。Lucene 通過(guò)降低相關(guān)度（比如默認(rèn)的計(jì)算分?jǐn)?shù)）對(duì)結(jié)果排序。排序的順序是可以更改的。

Java代碼

/* Search mails having the word 'job' in subject and return results

sorted by sender's email in descending order.

SortField sortField = new SortField("sender", true);

Sort sortBySender = new Sort(sortField);

WildcardQuery query = new WildcardQuery(new Term("subject","job*"));

TopFieldDocs topFieldDocs =

indexSearcher.search(query,null,20,sortBySender);

//Sorting by index order

topFieldDocs = indexSearcher.search(query,null,20,Sort.INDEXORDER);

/* Search mails having the word 'job' in subject and return resultssorted by sender's email in descending order.*/SortField sortField = new SortField("sender", true);Sort sortBySender = new Sort(sortField);WildcardQuery query = new WildcardQuery(new Term("subject","job*"));TopFieldDocs topFieldDocs =indexSearcher.search(query,null,20,sortBySender);//Sorting by index ordertopFieldDocs = indexSearcher.search(query,null,20,Sort.INDEXORDER);

Filtering 是限制搜索空間，只允許某個(gè)文檔子集作為搜索范圍的過(guò)程。您可以使用該功能實(shí)現(xiàn)對(duì)搜索結(jié)果進(jìn)行再次搜索，或者在搜索結(jié)果上實(shí)現(xiàn)安全性。Lucene 帶有各種內(nèi)置的過(guò)濾器，比如 BooleanFilter、CachingWrapperFilter、ChainedFilter、DuplicateFilter、PrefixFilter、QueryWrapperFilter、RangeFilter、RemoteCachingWrapperFilter、SpanFilter 等。Filter 可以傳遞到 IndexSearcher 的搜索方法，以過(guò)濾匹配篩選標(biāo)準(zhǔn)的篩選文檔。

Java代碼

/*Filter the results to show only mails that have sender field

prefixed with 'jobs' */

Term prefix = new Term("sender","jobs");

Filter prefixFilter = new PrefixFilter(prefix);

WildcardQuery query = new WildcardQuery(new Term("subject","job*"));

indexSearcher.search(query,prefixFilter,20);

/*Filter the results to show only mails that have sender fieldprefixed with 'jobs' */Term prefix = new Term("sender","jobs");Filter prefixFilter = new PrefixFilter(prefix);WildcardQuery query = new WildcardQuery(new Term("subject","job*"));indexSearcher.search(query,prefixFilter,20);

Lucene 是來(lái)自 Apache 的一個(gè)非常流行的開(kāi)源搜索庫(kù), 它為應(yīng)用程序提供了強(qiáng)大的索引編制和搜索功能。它提供了一個(gè)簡(jiǎn)單易用的 API，只需要稍微了解索引編制和搜索的原理即可使用。在本文中，您學(xué)習(xí)了 Lucene 架構(gòu)及其核心 API。

Lucene 為許多知名網(wǎng)站和組織提供了各種強(qiáng)大的搜索功能。它還兼容許多其他編程語(yǔ)言。Lucene 有一個(gè)活躍的大型技術(shù)用戶社區(qū)。如果您需要一些易用、可擴(kuò)展以及高性能的開(kāi)源搜索庫(kù)，Apache Lucene 是一個(gè)極佳的選擇。

本站僅提供存儲(chǔ)服務(wù)，所有內(nèi)容均由用戶發(fā)布，如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊舉報(bào)。

免费视频淫片aa毛片_日韩高清在线亚洲专区vr_日韩大片免费观看视频播放_亚洲欧美国产精品完整版