Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contribute Morphuntion from Apple as open source software #35

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

grhoten
Copy link
Member

@grhoten grhoten commented Oct 30, 2024

This contribution is related to the Unicode Technical Workshop topic about Inflection.

This contribution should resolve the following issues: #5, #6, #7, #11, #12, #13, #15, #17, #18, #19
This contribution is also related to the following issues without fully resolving the issues: 3, 4, 8, 10, 21, 23, 24, 25
This contribution also has an implementation that addresses these CLDR issues: 13025, 13563

@CLAassistant
Copy link

CLAassistant commented Oct 30, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@nciric nciric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round. I'll go into actual code in the next.

@@ -0,0 +1,118 @@
#
# Copyright 2018-2024 Apple Inc. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we keep the Apple copyright message? Mostly a question for Anne and George.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the email discussion, the previously included license as been removed in order to use the existing license in this repository, but these copyright notices should not be disturbed.

add_link_options(-g -fprofile-instr-generate -fcoverage-mapping)
endif()

#Set these warning properties on a project level
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Some comments start with capital letters, some don't. Some have space between # and text and some don't.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed this file to make the comments more consistent in style.

morphuntion/README.md Show resolved Hide resolved
morphuntion/README.md Show resolved Hide resolved

plugins {
id 'net.ltgt.errorprone' version '1.1.1' apply false
id 'maven-publish'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question - what do we use Java for in this project?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's dictionary-parser. Denny will likely want to look at that code. It can be reimplemented in another programming language when using wikidata as a data source, but I wanted to provide the tool source code so that the logic behind generating the lexical dictionaries is a lot clearer. The tool will likely be deleted once wikidata can be ingested.

This also had a Java wrapper through JNI. That implementation was excluded from this contribution at this time to expedite making this implementation open source.

#
# Copyright 2022-2024 Apple Inc. All rights reserved.
#
yue_HK=yue_Hant
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just mapping one form of LangID to another? Why (comment)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an aliasing scheme. In this case, we're making the script explicit for a given region.

tokenizer.normalize=\u062A
tokenizer.replace=\u0629

#tokenizer.decompound=(\u0648|\u0641)?(\u0628|\u0643|\u0644)?(\u0627\u0644)?(.+?)(\u064A|\u0646\u064A|\u0643|\u0646\u0627|\u0643\u0645|\u0643\u0646|\u0647|\u0647\u0627|\u0643\u0646|\u0647\u0645|\u0647\u0646)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been removed. It was informative about the previous implementation, but it's not helpful anymore.

/*
* Copyright 2016-2024 Apple Inc. All rights reserved.
*/
// This is a generated file. Do not edit this file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care how it's generated? If yes, some epointers to the generator would be good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generated with the generateICUWrapper tool. It takes an existing ICU installation, and generates C++ wrappers around the C API. This provides a natural adaptor for C++ API, but with the stability of the C ABI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file can be regenerated when ICU adds new C API.

BreakIterator(UBreakIteratorType type, const char * locale, std::u16string_view text) {
UErrorCode ec = U_ZERO_ERROR;
wrappee_ = ubrk_open(type, locale, (const UChar *)text.data(), (int32_t)text.size(), &ec);
::morphuntion::exception::ICUException::throwOnFailure(ec);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some companies don't allow exceptions in code. Do these bubble up to top or are they always caught and then API returns an error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're not caught on the C++ API. The C API wrappers around the C++ code converts the exceptions to CFErrorRef. If you don't want C++ exceptions, use the C API. If you want C++ exceptions, use the C++ API. C++ library code also generates exceptions. So you can't completely omit them from C++.

/*
* Copyright 2016-2024 Apple Inc. All rights reserved.
*/
#pragma once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this works everywhere we care about (we use guards instead). https://en.wikipedia.org/wiki/Pragma_once

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That page doesn't show any compilers that don't support it. I guess the only issue would be if you want to use an old compiler. From a maintenance point of view, this is much more reliable than using macros.

Copy link
Contributor

@nciric nciric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub makes reviewing 800+ files PR a nightmare (everything is super laggy and slow). Let's land this and then go over important details in follow up PRs.
I guess current legal discussion is preventing the merge.

@@ -0,0 +1,551 @@
//
// main.cpp
// icu4cxx_generate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if, long term, we could convert this into Python or some kind of a template language e.g. https://github.com/kainjow/Mustache

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, rewriting this in another programming language is fine. It’s rarely run anyway.

This contribution should resolve the following issues: #5, #6, #7, #11, #12, #13, #15, #17, #18, #19
This contribution is also related to the following issues without fully resolving the issues: 3, 4, 8, 10, 21, 23, 24, 25
This contribution also has an implementation that addresses these CLDR issues: 13025, 13563
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants