Vocabulary acquisition datasets

This repository contains a Snakemake workflow to do some preparation of various datasets relevant to modelling vocabulary with an eye towards modelling receptive vocabulary inventories. It can also do some enrichment of the data. The output formats are DuckDB and Arrow tables, which can be used comfortably from Python or R.

Vocabulary inventory datasets

These datasets

L1Ls = 1nd language learners L2Ls = 2nd language learners

Name	Target language	Type	Words	Participants	Availability
SVL12K	English	Self-assessed 5-point scale	12 000 from the SVL wordlist	16 L2Ls based in Japan	Personal website
EVKD1	English	Multiple choice (4) definitions from word in context	100 from the XXX vocabulary size test	100 L2Ls mainly based in Japan	Personal website (currently broken; direct request via email)
TestYourVocab	English	Self-assessed yes/no	~90-160 per participant from bank of 616	>1 627 968 L1Ls, >5 772 534 L2Ls from around the word	Direct request via email
ECP (English Crowdsourcing Project)	English	Lexical decision	~300-1000 per participant from bank of 62 000	700 000	Repository
ELP (English Lexicon Project)	English	Lexical decision	TODO	TODO	Website [Tool (3rd party)](https://github.com/JackEdTaylor/read-elp] Repository
BLP (British Lexicon Project)	English	Lexical decision	TODO	TODO	Departmental website
FLP (French Lexicon Project)	French	Lexical decision	TODO	TODO	Repository
DCP (Dutch Crowdsourcing Project)	Dutch	Lexical decision	TODO	TODO	Repository
DLP (Dutch Lexicon Project)	Dutch	Leixcal decision	TODO	TODO	Repository
DLP2 (Dutch Lexicon Project 2)	Dutch	Lexical decision	TODO	TODO	Repository
SPALEX	Spanish	Lexical decision	TODO	TODO	Repository
WordBank (Collection of many studies)	Multiple	Multiple	MacArthur-Bates Communicative Development Inventory (MB-CDI)	Children; TODO	Departmental website (accessed through public MySQL database, same as the wordbankr package)

Relevant word features

These datasets include features of words which are highly relevant to vocabulary inventory modelling. The most important beyond frequency, which is not covered here are age of acquisition and concreteness.

Candidates for addition

Individual datasets

Name	Link	Commment
X	https://www.iris-database.org/iris/app/home/detail?id=york%3a938002&ref=search
X	https://www.iris-database.org/iris/app/home/detail?id=york%3a939292&ref=search
X	https://www.iris-database.org/iris/app/home/detail?id=york%3a852665&ref=search

Lists

List of word knowledge megastudies on the webpages of the Center for Reading Research at Ghent University
Iris database
Chase references from Milton, J. (2009). Measuring Second Language Vocabulary Acquisition.

Cannot find trial level data

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
.github/workflows		.github/workflows
submodules		submodules
vocabaqdata		vocabaqdata
workflow		workflow
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
README.md		README.md
check_snakefile.sh		check_snakefile.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vocabulary acquisition datasets

Vocabulary inventory datasets

Relevant word features

Candidates for addition

Individual datasets

Lists

Cannot find trial level data

About

Releases

Packages

Languages

frankier/vocabaqdata

Folders and files

Latest commit

History

Repository files navigation

Vocabulary acquisition datasets

Vocabulary inventory datasets

Relevant word features

Candidates for addition

Individual datasets

Lists

Cannot find trial level data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages