M1 Plurital
clement.plancq@ens.psl.eu
« corpus in modern linguistics, in contrast to being simply any body of text, might more accurately be described as a finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration. »
(McEnery and Wilson 2001)
Définition stricte d’un corpus héritée de la linguistique de corpus
En TAL, corpus peut prendre une définition plus large : un ensemble de documents textuels
On parle aussi de collection de données langagières
Frontière ténue entre un corpus et un dataset
+-------------+
| texte brut |
+------+------+
| sentence splitting
+------v-------+
| phrases |
+------+-------+
| tokenisation
+------v-------+
| tokens |
+------+-------+
| POS tagging
+------v-------+
| tagged corpus|
+------+-------+
| Parsing
+------v-------+
| treebank |
+--------------+
One morning, when Gregor Samsa woke from troubled dreams, he found
himself transformed in his bed into a horrible vermin. He lay on
his armour-like back, and if he lifted his head a little he could
see his brown belly, slightly domed and divided by arches into stiff
sections.
–
<s>One morning, when Gregor Samsa woke from troubled dreams, he found
himself transformed in his bed into a horrible vermin.</s> <s> He lay on
his armour-like back, and if he lifted his head a little he could
see his brown belly, slightly domed and divided by arches into stiff
sections.</s>
"Oh, God!"
called his mother, who was already in tears, "he could be seriously
ill and we're making him suffer. Grete! Grete!" she then cried.
"Mother?" his sister called from the other side. They communicated
across Gregor's room. "You'll have to go for the doctor straight
away. Gregor is ill. Quick, get the doctor. Did you hear the way
Gregor spoke just now?" "That was the voice of an animal", said the
chief clerk, with a calmness that was in contrast with his mother's
screams.
? |
One morning, when Gregor Samsa woke from troubled dreams, he found
himself transformed in his bed into a horrible vermin.
–
One / morning, / when / Gregor / Samsa / woke / from / troubled / dreams, / he / found / himself / transformed / in / his / bed / into / a / horrible / vermin.
"What's happened to me?" he thought. It wasn't a dream.
–
?
L'homme/NPP/*L'homme était/V/être parti/VPP/partir de/P/de
Marchiennes/NPP/Marchiennes vers/P/vers deux/DET/*deux heures.
L' DET:ART le
homme NOM homme
était VER:impf être
parti VER:pper partir
de PRP de
Marchiennes NAM <unknown>
vers PRP vers
deux NUM deux
heures NOM heure
. SENT .
description du format CoNLL-2009 et CoNLL-U
Concordanciers : Wordsmith, AntConc
Outils de requête avec prise en compte des annotations (IMS CWB, NoSketch Engine)
Outils de requête pour treebanks (TGrep, Tregex, TIGERSearch, Annis, Grew)
Outils pour corpus alignés (opus)
Développé à Stuttgart pour IMS CWB Voir documentation
[word = "Marchiennes"]
[pos = "NAM"]
[] (n'importe quel mot)
[word = ".+eur"]
[word = ".+eur" & pos="N"]
[pos="N"] [pos="ADJ"]
<s> [pos = "V"]
[pos = "V"] </s>
Interrogation du corpus frWac avec l’outil NoSketchEngine
Trouver les noms commençants par ‘anti’
Trouver les mots contenant deux 'z' successifs
Trouver les fréquences par POS des mots se terminant par ‘able’
Comparer les fréquences des adverbes se terminant par ‘ment’ et ceux ne se terminant pas par ‘ment’
Compter le nombre de séquences “adj nom” et “nom adj”