This repository contains English-Amharic parallel corpus. We used different tools and techniques to collect the parallel corpus. As the main tools HTTrack and Heritrix are utilized to crawl and archive different websites and news blogs. Additionally, we downloaded a considerable amount of legal documents from different sources. Finally, we extracted the parallelly aligned text data from the collected raw data and merged them into a single UTF-8 file for each language. Still now we have collected a total of 225,304 sentences for each language.
-
Notifications
You must be signed in to change notification settings - Fork 1
This repository caontains English-Amharic parallel corpus . We used different tools and techniques to collect the parallel corpus. As the main tools HTTrack and Heritrix are utilized to crawl and archive different websites and news blogs. Additionally, we downloaded a considerable amount of legal documents from different sources. Finally, we ext…
License
yohannesb/English-Amharic-parallel-corpus
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
This repository caontains English-Amharic parallel corpus . We used different tools and techniques to collect the parallel corpus. As the main tools HTTrack and Heritrix are utilized to crawl and archive different websites and news blogs. Additionally, we downloaded a considerable amount of legal documents from different sources. Finally, we ext…
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published