The Dark Lord has a sinister plan

RubyLexer 0.8.0 Released

RubyLexer version 0.8.0 has been released!

RubyLexer is a lexer library for Ruby, written in Ruby. Rubylexer is meant as a lexer for Ruby that's complete and correct; all legal Ruby code should be lexed correctly by RubyLexer as well. Just enough parsing capability is included to give RubyLexer enough context to tokenize correctly in all cases. (This turned out to be more parsing than I had thought or wanted to take on at first.) RubyLexer handles the hard things like complicated strings, the ambiguous nature of some punctuation characters and keywords in ruby, and distinguishing methods and local variables.

install as a gem with: or download the package: on github:
gem install rubylexer
gem tar.gz tar.xz rubylexer
( checksums: sha1 sha256 sha512 )

Changes in this version:
  • 3 major enhancements:
    • new framework for extending the lexer using modules:
      • moved ruby 1.9 lexing logic into a separate module
      • moved most macro-specific lexing logic to a separate module in rubymacros
    • support for non-ascii encoding:
      • support ascii, binary, utf-8, and euc-* encodings in 1.9 mode
      • 1.8 mode allows binary encoding only
      • \uXXXX character escapes in 1.9 mode strings (and char lits)
      • which can turn a string into utf-8 even in non-utf-8 sources
    • support for the encoding line:
      • encoding line comes out as a separate token
      • Theres now a ShebangToken as well as the EncodingDeclToken
      • reading of encoding in -K option in shebang line improved
      • utf8 bom overrides all later encoding decls
  • 8 minor improvements:
    • in gemspec, find files relative to __FILE__ instead of pwd
    • there's now a rubylexer binary; works like the old dumptokens.rb
    • improved test coverage generally
    • defend RubyLexer against being defined by anyone else (_ahem_)
    • friendlier inspect
    • using my own definition of whitespace instead of \s
    • api changes to help redparse out:
      • __ keywords get assigned a value
      • added RubyLexer#unshift: to force tokens back on lexer input
  • 33 minor bugfixes:
    • fixed position attributes of tokens in some cases
    • use more noncapturing groups to avoid backref strangeness later
    • leave trailing nl (if any) at end of heredoc on input
    • emit saved-up here bodies before eof
    • emit right num of parens after unary * & after def and before param list
    • escaped newline token shouldnt have nl unless one was seen in input
    • fixed multi-assigns in string inclusions
    • premature eof in obscure places caused inf loop
    • corrected handling for do inside of assignment inside method param list
    • whitespace should never include trailing newline
    • better detection of ! and = at end of identifiers
    • disallow allow newline around :: in module header
    • cr no longer ends comments
    • !, !=, !~ should always be operator tokens, even in 1.8 mode
    • .. and ... should be operator tokens
    • fixes to unlexer:
      • append newline when unlexing here doc, but only if it had none already
      • improve formatting of dumptokens output when str inclusions are present
      • fixed unlexing of char constants when char is space or non-glyph
    • bugfixes in 1.9-mode lexing:
      • don't make multiassign in block params (directly or nested)
      • recognize lvars after ; in method and block param lists
      • recognize lvars in block param list better
      • 1.9 keywords correctly recognized and procesed
      • char literals in 1.9 mode are more like strings than numbers now
      • -> is considered an operator rather than value kw now
      • use ImplicitParamListStart/EndToken instead of KwParamListStart/EndToken for ->'s param list
      • the only chars at end which force an ident to be a method are now ?!=
      • recognize lvar after & or * in stabby block param list
    • changes for 1.9 compatibility:
      • eliminating 1.9 warnings generally
      • avoiding Array#to_s in 1.9 (sigh)
      • keep Token#inspect working in 1.9
      • fix CharSet#===