Popularity
7.0
Stable
Activity
0.0
Stable
19
3
5
Monthly Downloads: 10
Programming language: Haskell
License: MIT License
punkt alternatives and similar packages
Based on the "Natural Language Processing" category.
Alternatively, view punkt alternatives based on common mentions on social networks and blogs.
-
concraft-pl
A morphosyntactic tagger for Polish based on conditional random fields -
concraft
A morphosyntactic disambiguation library based on constrained conditional random fields -
minimorph
English spelling functions with an emphasis on simplicity. Originally by https://github.com/kowey. -
hist-pl
Programs and libraries related to the historical dictionary of Polish -
polh-lexicon
Programs and libraries related to the historical dictionary of Polish -
sentiwordnet-parser
Parser for the [SentiWordNet](http://sentiwordnet.isti.cnr.it/) tab-separated file -
crf-chain2-tiers
Second-order, tiered, constrained, linear conditional random fields -
concraft-hr
A part-of-speech tagger for Croatian based on the concraft library. -
penntreebank-megaparsec
Megaparsec parsers for trees in the Penn Treebank format
Static code analysis for 29 languages.
Your projects are multi-language. So is SonarQube analysis. Find Bugs, Vulnerabilities, Security Hotspots, and Code Smells so you can release quality code every time. Get started analyzing your projects today for free.
Promo
www.sonarqube.org
Do you think we are missing an alternative of punkt or a related project?
README
punkt
Multilingual unsupervised sentence tokenization with Punkt.
Usage
Note that abbreviations are detected at run time without the aid of a pre-built abbreviation list:
import Data.Text (Text, pack)
import NLP.Punkt (split_sentences)
corpus :: Text
corpus = pack "Look, Ma! The quick brown Mr. T. rex swallowed the lazy dog. \
\It really did!"
main :: IO ()
main = mapM_ print (split_sentences corpus)
yields:
"Look, Ma!"
"The quick brown Mr. T. rex swallowed the lazy dog."
"It really did!"
References
Kiss, Tibor, and Jan Strunk. "Unsupervised multilingual sentence boundary detection." Computational Linguistics 32.4 (2006): 485-525.
TODO
- parallelize
- modularize tokenization
- custom tokenization rules
- needs to go fasterer