Well, there was some difficulty from the subtlety of adding postings to a word which already exists in the lexicon. The crux of our problem is this: words parsed from article bodies either exist or don't exist in the lexicon. If they do exist, then we must check if they exist for the current document id, otherwise we must create a new posting. The code which solves this problem is seen in the InvertedIndex implementation of the Index interface, under the function, addLexicalTerms:
public void addLexicalTerms(Document d) { Pattern p = Pattern.compile("\\w+"); String[] temp = d.getBodyText().split("\\ "); // split on spaces //d.addBody(""); for(int j=0;j<temp.length;j++) { Matcher m = p.matcher(temp[j]); if(!temp[j].equals("")&&!(temp[j].length()==1)&&m.matches()) { if(!StopWordList.getInstance().containsStopWord(temp[j].toLowerCase())) { if(lexicon.containsKey(temp[j])) { boolean exitFlag = false; for(int k=0;k<lexicon.get(temp[j]).size();k++) { if(lexicon.get(temp[j]).get(k).getDocID().equals(d.getDocumentID())) { lexicon.get(temp[j]).get(k).increaseFrequency(); } else if(k==(lexicon.get(temp[j]).size()-1)) { exitFlag=true; } } if(exitFlag) { Posting newPosting = new Posting(d.getDocumentID()); newPosting.increaseFrequency(); lexicon.get(temp[j]).add(newPosting); postingCount++; } } else if(!lexicon.containsKey(temp[j])) { /* * We should create a new lexical entry */ Posting newPosting = new Posting(d.getDocumentID()); newPosting.increaseFrequency(); LinkedList<Posting> tempList = new LinkedList<Posting>(); tempList.add(newPosting); lexicon.put(temp[j], tempList); postingCount++; } } } } }
|
because not many people like reading others code, here's a better explanation
No comments:
Post a Comment