Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I have implemented parse_roman() function #64

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

AmPhIbIaN26
Copy link

I have implemented the parse_roman() function.

I have implemented roman_numeral() function with the use of regex to build the number.
@codecov
Copy link

codecov bot commented Apr 19, 2021

Codecov Report

Merging #64 (086f5ec) into master (c834854) will decrease coverage by 0.65%.
The diff coverage is 95.16%.

@@            Coverage Diff             @@
##           master      #64      +/-   ##
==========================================
- Coverage   98.78%   98.12%   -0.66%     
==========================================
  Files          86       86              
  Lines         328      374      +46     
  Branches       60       78      +18     
==========================================
+ Hits          324      367      +43     
  Misses          1        1              
- Partials        3        6       +3     
Impacted Files Coverage Δ
number_parser/parser.py 97.57% <95.08%> (-0.78%) ⬇️
number_parser/__init__.py 100.00% <100.00%> (ø)

@Gallaecio
Copy link
Member

Now I think we need to make parse and parse_number use this function as well.

I’m thinking that we may also want to include a new parameter to those functions, numeral_systems, which allows to limit the numeral systems to support while parsing. We could make it so that by default all numeral systems are used, but you can use numeral_systems=['decimal'] to limit parsing to decimal numbers, or numeral_systems=['roman'] to limit it to roman numbers.

For cases where users want to exclude a system, rather than include it, it may make sense to expose a public variable that contains the list of all supported numeral systems, e.g. number_parser.NUMERAL_SYSTEMS, so that users can create an exclusion-based subset (e.g. [system for system in NUMERAL_SYSTEMS if system != 'roman']).

Once these changes are done, I think we no longer need parse_roman itself, and it could be renamed as _parse_roman to discourage people from using it.

I’m also thinking that it may be possible to have slightly different implementations of parse_roman for each of the user-facing number-parser functions, for performance-tuning. But I haven’t look into it in detail, and what’s most important is that things work as intended, we can worry about performance later.

@AmPhIbIaN26 AmPhIbIaN26 changed the title I have implemented parse_roman() funciton I have implemented parse_roman() function Apr 19, 2021
@AmPhIbIaN26
Copy link
Author

AmPhIbIaN26 commented Apr 25, 2021

I still didn't understand how the user would use this [system for system in NUMERAL_SYSTEMS if system != 'roman']

Let’s say we introduce a numeral_systems parameter to parse_number so that users may use:

parse_number('V', numeral_systems=['roman'])

If we also create a NUMERAL_SYSTEMS with all supported numeral systems, users that want to exclude a numeral system can do:

all_numeral_systems_but_roman = [system for system in NUMERAL_SYSTEMS if system != 'roman']
parse_number('V', numeral_systems=all_numeral_systems_but_roman)

If in the future we add support for additional number systems, that code would still work as intended. Whereas if users had to hardcode the list of systems, new systems added later would also be excluded.

@AmPhIbIaN26 AmPhIbIaN26 reopened this Apr 25, 2021
Added the  _valid_input_by_numeral_system(), it will decide the numeral system based on the input string if not given by the user.
@AmPhIbIaN26
Copy link
Author

AmPhIbIaN26 commented Apr 28, 2021

Hi @Gallaecio hope you and your family are safe.

I added the _valid_input_by_numeral_system() (in this fork) still have to improve this so it is easier to add another numeral system. I will work on that now and also improving the performance of _parse_roman().

added NUMERAL_SYSTEM list where you decide which numeral system should be used to parse.
Added more test cases to test the numeral system
@AmPhIbIaN26
Copy link
Author

I have added numeral_systems as a parameter to parse(), the current system works like this

all_numeral_systems_but_roman = [system for system in NUMERAL_SYSTEMS if system != 'roman']
all_numeral_systems_but_decimal = [system for system in NUMERAL_SYSTEMS if system != 'decimal']

>>>parse('Built in MMLXXVII.')
'Built in 2077.'

>>>parse( 'Built in MMLXXVII.', ['decimal'])
'Built in MMLXXVII.'

>>>parse('I was given two IV injections.', all_numeral_systems_but_roman)
'I was given 2 IV injections.'

>>>parse('I was given two IV injections.', all_numeral_systems_but_decimal)
1 was given two 4 injections.'

>>>parse('I was given two IV injections.')
'1 was given 2 4 injections.'

>>>parse('I have three apples.', all_numeral_systems_but_roman)
'I have 3 apples.'

I have added all these examples as test cases. I wanted to ask that since parse_number() only takes in 1 input how will having numeral_systems as a parameter help?

number_parser/parser.py Outdated Show resolved Hide resolved
number_parser/parser.py Outdated Show resolved Hide resolved
number_parser/parser.py Outdated Show resolved Hide resolved
number_parser/parser.py Outdated Show resolved Hide resolved
number_parser/parser.py Outdated Show resolved Hide resolved
tests/data/permutations/all_roman_numbers.csv Show resolved Hide resolved
tests/data/permutations/all_roman_numbers.csv Show resolved Hide resolved
@Gallaecio
Copy link
Member

Gallaecio commented May 4, 2021

I have added all these examples as test cases. I wanted to ask that since parse_number() only takes in 1 input how will having numeral_systems as a parameter help?

It could allow users to fine-tune for performance if they know beforehand the numeral system of the input number, by making number-parser only use the desired number parser. Also, if they want parsing to simply fail for something like I, instead of returning 1.

@AmPhIbIaN26
Copy link
Author

Thanks for looking into these, ill make the changes

@AmPhIbIaN26
Copy link
Author

@Gallaecio hope you and your family are safe.
I have made all the changes you suggested also added support for incorrect roman numbers.

number_parser/parser.py Outdated Show resolved Hide resolved
number_parser/parser.py Outdated Show resolved Hide resolved
number_parser/parser.py Outdated Show resolved Hide resolved
@AmPhIbIaN26
Copy link
Author

Hi @Gallaecio hope you and your family are safe.

Any thoughts on this?

SENTENCE_SEPARATORS = [".", ","]
SUPPORTED_LANGUAGES = ['en', 'es', 'hi', 'ru']
RE_BUG_LANGUAGES = ['hi']
NUMERAL_SYSTEMS = ('decimal', 'roman')
ROMAN_REGEX_EXPRESSION = "(?i)^(m{0,3})(cm|cd|d?c{0,4})(xc|xl|l?x{0,4})(ix|iv|v?i{0,4})$"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s make this a private constant, so that we can freely rename it or move it in the future if we wish without breaking the API:

Suggested change
ROMAN_REGEX_EXPRESSION = "(?i)^(m{0,3})(cm|cd|d?c{0,4})(xc|xl|l?x{0,4})(ix|iv|v?i{0,4})$"
_ROMAN_REGEX_EXPRESSION = "(?i)^(m{0,3})(cm|cd|d?c{0,4})(xc|xl|l?x{0,4})(ix|iv|v?i{0,4})$"

@@ -241,7 +244,11 @@ def parse_ordinal(input_string, language=None):
return parse_number(output_string, language)


def parse_number(input_string, language=None):
def _is_roman(search_string):
return re.search(ROMAN_REGEX_EXPRESSION, search_string, re.IGNORECASE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(?i) is already in the pattern. Also, this function should return a boolean value.

Suggested change
return re.search(ROMAN_REGEX_EXPRESSION, search_string, re.IGNORECASE)
return bool(re.search(ROMAN_REGEX_EXPRESSION, search_string))

if _is_roman(input_string):
numeral_systems = ['roman']

elif language in SUPPORTED_LANGUAGES and not _is_roman(input_string):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not _is_roman(input_string) will always be true, given the preceding if statement.

Suggested change
elif language in SUPPORTED_LANGUAGES and not _is_roman(input_string):
elif language in SUPPORTED_LANGUAGES:

Also, I am probably forgetting something, but why are we forcing decimal for any supported language? I could understand doing it for en given I is an English word, but I don’t think we need to avoid Roman numerals in Spanish, for example.

elif language in SUPPORTED_LANGUAGES and not _is_roman(input_string):
numeral_systems = ['decimal']

for numeral_system in numeral_systems:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail if numeral_systems is None.

Comment on lines +280 to +284
return None
number_built = _build_number(normalized_tokens, lang_data)
if len(number_built) == 1:
return int(number_built[0])
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those return None seem like an issue now within a for loop. It looks like they will prevent the for loop to reach the next iteration (for the next numeral system).

Maybe you could move this code into a _parse_decimal function, and only return if the result is not None, else let the for loop go to the next numeral system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants