korpora zei sisku
(Redirected from Corpora)Jump to navigation Jump to search
What is la korpora zei sisku?
la korpora zei sisku is a new Lojban corpus searching system created by la danr (Dan Rosén). The system is hosted by la durka.
What is the advantage compared with the previous corpus search?
- Good old jbofi'e has attempted to parse all sentences, and its terbri information is extracted from successful parses. When the parse fails, cmafi'e is used for word segmentation and selma'o-tagging.
- It searches not only from the registered texts and irc logs, but also from the older mailing list archives, the Lojban tiki that seems to be written in Lojban, jboselkei and tatoeba.
- (Note: the irclog itself seems to be sleeping after 2015-06-02, and irc texts in the sleeping period might not be inserted in the corpus.)
- The previous corpus search began to be attacked by spams around the year 2014. Moreover, moving of the Lojban main page from tiki to MediaWiki in March 2015 damaged the function of automatic extraction of texts from lojban.org pages. All the dying data were rescued by la guskant and la danr before 2015-06-15, and are now available by la korpora zei sisku.
- It may violate copyright law that the previous corpus list shows full texts in the corpus. On the other hand, it is clearly legal that la korpora zei sisku shows the search result only in pieces of text accompanied by a URL to the full text.
How can I add my new text to ralju korpora?
You need a github account. Pull the repository for the corpus, add a new text into jbokorp/corpus_import/ralju/original folder in xml format. See other texts there for the tags and attributes. After add and commit, send a pull request. If you are not familiar with github, ask any other jbopli on github for help. For example la guskant will be glad to help adding your text to the corpus.
To be considered
- Lojbab's old archives
- Wake up irclogs