Double-counting the documents containing an item

If an item, for example, "Bourqoqne" appears multiple times in a given document, "Coche-Dury Bourgogne Chardonay 2005, Bourgogne, France", your algorithm will append this same item into the IrIndex.index list and IrIndex.tf list multiple times.    This multiple-append implementation distorts the calculation of total number of documents containing the given item in the following code:

idf = log( float( len(self.documents) ) / float( len(self.tf[term]) ) )

I changed the code from:

for term in terms:
            if term not in self.index:
                self.index[term] = []
                self.tf[term] = []
                
            self.index[term].append(document_pos)
            self.tf[term].append(terms.count(term))

to:

for term in terms:
            if term not in self.index:
                self.index[term] = []
                self.tf[term] = []
                
            if document_pos not in self.index[term]:
                self.index[term].append(document_pos)
                self.tf[term].append(terms.count(term))

by skipping the subsequent append operations if an item in conjunction with its containing document is already recorded inside an IrIndex object.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double-counting the documents containing an item #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Double-counting the documents containing an item #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions