Popularity
7.0
Growing
Activity
0.0
Stable
20
3
5
Monthly Downloads: 7
Programming language: Haskell
License: MIT License
punkt alternatives and similar packages
Based on the "Natural Language Processing" category.
Alternatively, view punkt alternatives based on common mentions on social networks and blogs.
-
concraft-pl
A morphosyntactic tagger for Polish based on conditional random fields -
concraft
A morphosyntactic disambiguation library based on constrained conditional random fields -
minimorph
English spelling functions with an emphasis on simplicity. Originally by https://github.com/kowey. -
hist-pl
Programs and libraries related to the historical dictionary of Polish -
polh-lexicon
Programs and libraries related to the historical dictionary of Polish -
sentiwordnet-parser
Parser for the [SentiWordNet](http://sentiwordnet.isti.cnr.it/) tab-separated file -
crf-chain2-tiers
Second-order, tiered, constrained, linear conditional random fields -
concraft-hr
A part-of-speech tagger for Croatian based on the concraft library. -
penntreebank-megaparsec
Megaparsec parsers for trees in the Penn Treebank format
Access the most powerful time series database as a service
Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.
Promo
www.influxdata.com
Do you think we are missing an alternative of punkt or a related project?
README
punkt
Multilingual unsupervised sentence tokenization with Punkt.
Usage
Note that abbreviations are detected at run time without the aid of a pre-built abbreviation list:
import Data.Text (Text, pack)
import NLP.Punkt (split_sentences)
corpus :: Text
corpus = pack "Look, Ma! The quick brown Mr. T. rex swallowed the lazy dog. \
\It really did!"
main :: IO ()
main = mapM_ print (split_sentences corpus)
yields:
"Look, Ma!"
"The quick brown Mr. T. rex swallowed the lazy dog."
"It really did!"
References
Kiss, Tibor, and Jan Strunk. "Unsupervised multilingual sentence boundary detection." Computational Linguistics 32.4 (2006): 485-525.
TODO
- parallelize
- modularize tokenization
- custom tokenization rules
- needs to go fasterer