Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

case unsensitive search #16

Open
popanz opened this issue May 30, 2017 · 10 comments
Open

case unsensitive search #16

popanz opened this issue May 30, 2017 · 10 comments

Comments

@popanz
Copy link

popanz commented May 30, 2017

Hi,
I think it would make sense to build in an option to search case unsensitive.

@mkiol
Copy link
Owner

mkiol commented May 30, 2017

Yes, it makes totally sense. Unfortunately there is no easy way to implement it. ZIM file, as well as Wikipedia, contains case-sensitive title index.

The only possible implementation I see right now, is to emulate case insensitive by doing multiple searches. For instance when you type "case sensitivity", Zimpedia should make at least 4 queries: "case sensitivity", "Case sensitivity", "Case Sensitivity", "case Sensitivity". Then it is need to merge 4 results, eliminate duplicates etc.

Here is the right quote from wikimedia article that summarize the problem:

Case sensitivity in MediaWiki is both a blessing and a curse

@kelson42
Copy link

@popanz @mkiol We are working currently at Kiwix to improve maybe fully fix this problem. Solution will be implemented in zimlib and probably also partly in kiwix-lib. But kiwix-lib is now a separate git repo, which can easily be reused.

@mkiol
Copy link
Owner

mkiol commented May 30, 2017

@kelson42 That is great news! Zimpedia uses zimlib, so implementation in zimlib interest me the most. Is there any issue number that I could observe to be informed about the implementation status?

@kelson42
Copy link

The question of introducing a fully case insensitive suggestion system is still open on my side. For now we basically try to generalised the fulltext (case insensitive) search.

@Frenzie
Copy link

Frenzie commented Feb 3, 2019

I never really noticed this in Kiwix on Android because there the keyboard starts in lowercase (in Kiwix search, not in a general textfield). So maybe there's a simple way to set something like keyboard.autocaps = false for the search field?

On a slightly related note, but perhaps this should be its own FR, it'd be nice if eleve could also find élève. Typing all those accents is a bit of a pain on a phone.

@mkiol
Copy link
Owner

mkiol commented Feb 5, 2019

So maybe there's a simple way to set something like keyboard.autocaps = false for the search field?
It'd be nice if eleve could also find élève

Unfortunately it is not so simple right now. A search result retried from libzim is case sensitive, so to achieve what you've suggested few searches with different case variants (élève, elève, éleve, eleve, elevé, etc.) should be made. All results should be de-duplicated and merged. It is possible but complicated...

As suggested @kelson42, maybe you should try full-text search mode (this feature was added in the recent Zimpedia update). It is case insensitive but unfortunately results could be sometimes unpredictable.

@Frenzie
Copy link

Frenzie commented Feb 5, 2019

Those two paragraphs shouldn't be taken together like that. ;-)

I was slightly wrong about present-day Android Kiwix. I suspect it might perform two searches, one lowercase and one uppercase. (But certainly not full text.)

But what I was talking about is that in Android Kiwix, the keyboard opens lowercase, like this. I know Sailfish can do that too,e.g., for the browser addressbar.

In Zimpedia, it opens like this:

So if you just start typing a word like bear, you'll unintentionally search for Bear. Then you'll only find, say, Bearbeitung. With the exception of German, I think that defaulting to lowercase would largely resolve the issue without any change to the actual searching code. And presumably that's just a simple flag somewhere.

Kiwix will not, of course, find the word élève by typing [Ee]leve. That's just something I'd like it to do. There are fairly standard Unicode algorithms for that I believe.

@mkiol
Copy link
Owner

mkiol commented Feb 5, 2019

Thanks for the clarification. I will look in to it.

@mkiol
Copy link
Owner

mkiol commented Feb 6, 2019

In 3ab979e I've added following search procedure:

  1. first search with upper case first letter
  2. second search with lower case first letter
  3. results are merged and sorted case insensitive

It works pretty well... but élève case is far more complex...

@Frenzie
Copy link

Frenzie commented Feb 6, 2019

Very nice! 👍

but élève case is far more complex...

Well yeah, diacritics are hard. ;-) I'll have to check what GoldenDict does because you've made me curious. Iirc it performs a variety of clever tricks with diacritics and morphology alike.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants