Popularity
7.0
Stable
Activity
0.0
Stable
21
4
5

Monthly Downloads: 7
Programming language: Haskell
License: MIT License

punkt alternatives and similar packages

Based on the "Natural Language Processing" category.
Alternatively, view punkt alternatives based on common mentions on social networks and blogs.

Do you think we are missing an alternative of punkt or a related project?

Add another 'Natural Language Processing' Package

README

punkt

Multilingual unsupervised sentence tokenization with Punkt.

Usage

Note that abbreviations are detected at run time without the aid of a pre-built abbreviation list:

import Data.Text (Text, pack)
import NLP.Punkt (split_sentences)

corpus :: Text
corpus = pack "Look, Ma! The quick brown Mr. T. rex swallowed the lazy dog. \
              \It really did!"

main :: IO ()
main = mapM_ print (split_sentences corpus)

yields:

"Look, Ma!"
"The quick brown Mr. T. rex swallowed the lazy dog."
"It really did!"

References

Kiss, Tibor, and Jan Strunk. "Unsupervised multilingual sentence boundary detection." Computational Linguistics 32.4 (2006): 485-525.

TODO

  • parallelize
  • modularize tokenization
    • custom tokenization rules
  • needs to go fasterer