Information Retrieval & Text Mining Toolbox

This repository holds functions pivotal for IRTM processing. This repo. is staged for continuous development.

Quick Install using 'pip/pip3' & GitHub

pip install git+https://github.com/KanishkNavale/IRTM-Toolbox.git

Import Module

from irtm.toolbox import *

Using Functions

Soundex: A phonetic algorithm for indexing names by sound, as pronounced in English.
```
print(soundex('Muller'))
print(soundex('Mueller'))
```
```
>>> 'M466'
>>> 'M466'
```

Tokenizer: Converts a sequence of characters into a sequence of tokens.

print(tokenize('LINUX'))
print(tokenize('Text Mining 2021'))

>>> ['linux']
>>> ['text', 'mining']

Vectorize: Converts a string to token based weight tensor.

vector = vectorize([
        'texts ([string]): a multiline or a single line string.',
        'dict ([list], optional): list of tokens. Defaults to None.',
        'enable_Idf (bool, optional): use IDF or not. Defaults to True.',
        'normalize (str, optional): normalization of vector. Defaults to l2.',
        'max_dim ([int], optional): dimension of vector. Defaults to None.',
        'smooth (bool, optional): restricts value >0. Defaults to True.',
        'weightedTf (bool, optional): Tf = 1+log(Tf). Defaults to True.',
        'return_features (bool, optional): feature vector. Defaults to False.'
        ])

print(f'Vector Shape={vector.shape}')

>>> Vector Shape=(8, 37)

Predict Token Weights: Computes importance of a token based on classification optimization.

dictionary = ['vector', 'string', 'bool']
vector = vectorize([
        'X ([np.array]): vectorized matrix columns arraged as per the dictionary.',
        'y ([labels]): True classification labels.',
        'epochs ([int]): Optimization epochs.',
        'verbose (bool, optional): Enable verbose outputs. Defaults to False.',
        'dict ([type], optional): list of tokens. Defaults to None.'
        ], dict=dictionary)

labels = np.random.randint(1, size=(vector.shape[0], 1))
weights = predict_weights(vector, labels, 100, dict=dictionary)

>>> Token-Weights Mappings: {'vector': 0.22097790924850977, 
                             'string': 0.39296369957440075, 
                             'bool': 0.689853175081446}

Page Rank: Computes page rank from a chain matrix

chain_matrix = np.array([[0, 0, 1],
                         [1, 0, 1],
                         [0, 1, 0]])

print(page_rank(chain_matrix))

rank, TPM = page_rank(chain_matrix, return_TransMatrix=True)
print(f'Page Rank: {rank} \nTransition Probablity Matrix: \n{TPM}')

>>> [0.0047 0.997  0.0767]
>>> Page Rank: [0.0047 0.997  0.0767] 
    Transition Probablity Matrix: 
    [[0.03333333 0.03333333 0.93333333]
    [0.48333333 0.03333333 0.48333333]
    [0.03333333 0.93333333 0.03333333]]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Information Retrieval & Text Mining Toolbox

Quick Install using 'pip/pip3' & GitHub

Import Module

Using Functions

Files

README.md

Latest commit

History

README.md

File metadata and controls

Information Retrieval & Text Mining Toolbox

Quick Install using 'pip/pip3' & GitHub

Import Module

Using Functions