Skip to content

This repository caontains English-Amharic parallel corpus . We used different tools and techniques to collect the parallel corpus. As the main tools HTTrack and Heritrix are utilized to crawl and archive different websites and news blogs. Additionally, we downloaded a considerable amount of legal documents from different sources. Finally, we ext…

License

Notifications You must be signed in to change notification settings

yohannesb/English-Amharic-parallel-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

English-Amharic-parallel-corpus

This repository contains English-Amharic parallel corpus. We used different tools and techniques to collect the parallel corpus. As the main tools HTTrack and Heritrix are utilized to crawl and archive different websites and news blogs. Additionally, we downloaded a considerable amount of legal documents from different sources. Finally, we extracted the parallelly aligned text data from the collected raw data and merged them into a single UTF-8 file for each language. Still now we have collected a total of 225,304 sentences for each language.

About

This repository caontains English-Amharic parallel corpus . We used different tools and techniques to collect the parallel corpus. As the main tools HTTrack and Heritrix are utilized to crawl and archive different websites and news blogs. Additionally, we downloaded a considerable amount of legal documents from different sources. Finally, we ext…

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published