If an item, for example, "Bourqoqne" appears multiple times in a given document, "Coche-Dury Bourgogne Chardonay 2005, Bourgogne, France", your algorithm will append this same item into the IrIndex.index list and IrIndex.tf list multiple times. This multiple-append implementation distorts the calculation of total number of documents containing the given item in the following code:
idf = log( float( len(self.documents) ) / float( len(self.tf[term]) ) )
I changed the code from:
for term in terms:
if term not in self.index:
self.index[term] = []
self.tf[term] = []
self.index[term].append(document_pos)
self.tf[term].append(terms.count(term))
to:
for term in terms:
if term not in self.index:
self.index[term] = []
self.tf[term] = []
if document_pos not in self.index[term]:
self.index[term].append(document_pos)
self.tf[term].append(terms.count(term))
by skipping the subsequent append operations if an item in conjunction with its containing document is already recorded inside an IrIndex object.
If an item, for example, "Bourqoqne" appears multiple times in a given document, "Coche-Dury Bourgogne Chardonay 2005, Bourgogne, France", your algorithm will append this same item into the IrIndex.index list and IrIndex.tf list multiple times. This multiple-append implementation distorts the calculation of total number of documents containing the given item in the following code:
idf = log( float( len(self.documents) ) / float( len(self.tf[term]) ) )
I changed the code from:
for term in terms:
if term not in self.index:
self.index[term] = []
self.tf[term] = []
to:
for term in terms:
if term not in self.index:
self.index[term] = []
self.tf[term] = []
by skipping the subsequent append operations if an item in conjunction with its containing document is already recorded inside an IrIndex object.