word Resolution Algorithm: Difference between revisions

From Lojban
Jump to navigation Jump to search
mNo edit summary
 
(→‎Discussion: add counterexmple to regularity of the morphology)
 
(13 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{jbocre/en}}
See the [[BPFK|BPFK]]'s [http://www.lojban.org/phpbb/viewforum.php?f=72 Morphology Forum]


.i a'o le fu'ivla pu'o basti di'e poi gismu (basti: x1 replaces/substitutes for/instead of x2...) e'o ko ciksi tu'a le do jufra (to
I think I could not spent one day more without creating a wiki page for this issue.
{{vajni|While Lojban has a very precise, strict, easy-to-parse, thus unambiguous ''grammar'', it has <u>not</u> been proved ''syntactically'' unambiguous.}}
Let me rephrase this statement.


di'e na'e zasti gismu toi)
If we consider the Lojban language in a purely formal sense, involving an '''alphabet''' composed of its ''words'', supposedly classified into ''lexemes'', then <u>any finite ''string''</u> of lojban (stream of ''words'') can be proven either grammatically correct or not in a finite amount of time using a Turing machine.


''i mi se slabu le terbri be zo basti i su'o jbopli cu zmanei le catni nalselzanru gismu (to le gismu cu zasti i na se liste le catni gimste vau po'o toi) i mo'a da kulnu gismu ije le se klani be loi gismu tarmi cu banzu le nu finti lo selsumji gismu i mi na [[jbocre: smujmi|smujmi]] le do jufra i do cusku no da poi se lidne zo di'e''
This is what the Lojban grammar is for: it defines axioms and inference rules between its words (more precisely its ''lexemes'') in a form that can be fed into a [[LALR|LALR]](1) parser. And because the [[LALR|LALR]](1) parsing algorithm can be implemented using a [[Turing Machine|Turing Machine]], and that it can tell, within a finite amount of time, whether a (finite) string of lexemes follow its grammar or not, we have the result stated above.


le se lidne be zo di'e pupu zasti pu lenu do .o lo drata jbopli cu setca pa jufra
Now, what does that mean ?


''i ko tcidu la'e zoi zoi [http://groups.yahoo.com/group/lojban/message/2137] zoi''
It means that given any stream of Lojban text, if we are able to distinguish its words and categorize them into lexemes, then (roughly) we can parse the text using a computer; however it actually ''presupposes'' that we know a way to distinguish the words and categorize them.


zo gugrnorge basti zo norgo
And so ?


zo gugrtalia basti zo talno
Two facts:
* To this day, the way to distinguish the words in a lojban stream of text is very, very poorly documented (from a computer programmer's point of view). The only rules are given in the [[refgram|refgram]], are vague, and <u>were never proven unambiguous</u>.
* The most precise lojban lexer to this day, the morph algorithm of [[jbofi'e|jbofi'e]], has not been proved complete nor unambiguous, while already beginning to be hugely complex. IIRC, the more a system is complex, the less it is credible to the human mind.


zo gugrturki,e basti zo turko
==== Discussion ====


zo gugrxrvatska basti zo xorvo
As far as I can understand in the morphology of lojban, a complete lexer simply cannot be written using a [[LR|LR]](k) nor [[LL|LL]](k) grammar, for various reasons related to backtracking and the relation between semantics and morphology.


zo kri'ibe basti zo kribo
* What does anything have to do with semantics? brkwords.txt will correctly break up '''djan.polake'ebrIvlaboinoigIsmumu''' just as well as something grammatical.
 
** However, we still have a hard time with "su" and "sa", even "si" when it is used to erase a morphologically wrong piece of text. Not to mention that it is possible that the morphology in itself is possible, while altogether being unambiguous when considered together with the grammar (like an ambiguity that cannot happen because the ambiguous lexemes are not allowed in such or such place by the grammar).
''.i pu tcidu .iku'i sruma le du'u do na nelci lo nalselzanru gismu .isemu'ibo do djica lenu basti ra kei lenu pilno lo catni fu'ivla po'o (to pe'i le su'o jbopli poi zmanei le catni nalselzanru gismu ku'o cu me na'ebo do gi'e me ma zo'o toi)''
* The attempt in [[jbofi'e|jbofi'e]] to solve the problem using [[NFA|NFA]]'s is interesting, but leads to a huge graph of automata, which is very difficult to prove right.
 
* Elrond:
i mi je'a nelci le'e catni nalselzanru gismu i ki'u ma do sruma i mi cusku le se du'u mi pacna le nu le catni nalselzanru gismu cu basti le fu'ivla
*: While continuing to search for an elegant algorithmic solution to this issue, I nonetheless believe that real parsing of Lojban using computers (one which involves a real lexer) is a very difficult task unlikely to be done right soon.
 
* One should see the file [http://www.lojban.org/files/software/BRKWORDS.TXT BRKWORDS.TXT] for the official LLG morphology algorithm.
''cu '''se''' basti le fu'ivla''
** Definitely not official!
 
*** It is what the official LLG parser uses, isn't it?
i a'unai ienaicai i paunai le nu jimpe ma cu nandu i mi zmanei le gismu poi na'e se zanru le catni ku'o le fu'ivla gi'e pacna le du'u le gismu cu basti le fu'ivla to mi zmanei gy toi i mi jimpe le du'u do na tugni mi
**** No. As noted elsewhere, the official parser can only segment compound cmavo, and otherwise relies on spaces/dots (which it considers the same).
 
** It seems to work quite well, but it could probably be polished up a little on type 4 fu'ivla.
[[jbocre: .i tu'e sei cnino cusku .i mi co'a nelci .ueru'e lei fu'ivla bele'a li vo be'o noi se cmima ji'a lei nalselzanru gismu .iku'i ba'e na go'i va'o lenu vo'e se pilno secau le ve ciksi .isa'unai mi tcidu da'i lo se karni .i le finti cu pilno .i'enai zo tci'ile .a zo spero .i le se go'i na catni lojbo valsi gi'e la'a cfipu mi .i .e'u le finti cusku .i'e lu le gugdrtcile cei tci'ile li'u gi'e punaijeba pilno zo tci'ile .i zo cei cu je'a se pilno xamgu pe'i .i lo'e finti cu .e'a ka'e galfi le bangu ba'epe le se finti be ri .i tu'u|.i tu'e sei cnino cusku .i mi co'a nelci .ueru'e lei fu'ivla bele'a li vo be'o noi se cmima ji'a lei nalselzanru gismu .iku'i ba'e na go'i va'o lenu vo'e se pilno secau le ve ciksi .isa'unai mi tcidu da'i lo se karni .i le finti cu pilno .i'enai zo tci'ile .a zo spero .i le se go'i na catni lojbo valsi gi'e la'a cfipu mi .i .e'u le finti cusku .i'e lu le gugdrtcile cei tci'ile li'u gi'e punaijeba pilno zo tci'ile .i zo cei cu je'a se pilno xamgu pe'i .i lo'e finti cu .e'a ka'e galfi le bangu ba'epe le se finti be ri .i tu'u]]
* [[Jay Kominek|Jay]]:
 
*: At one point I put some thought into trying to write a [[context free grammars|context free grammar]] describing the morphology. (Not a context-free one, however. I was intending to use the grammar in conjunction with a generalized parser, like [[General Parsing|Tomita]]'s or [[General Parsing|Early]]'s). Didn't get anywhere. I think that if an [[NFA|NFA]] can parse it, there should definitely be a [[CFG|CFG]] which can describe it, however. ([[CFG|(CFG]]s being more powerful language-recognition mechanisms than automata.)
----
** [[Jay Kominek|Jay Kominek]]:
 
**: Btw, parsing, even with an [[NFA|NFA]] is Really Really fast. So even if the process is ugly, then just as long as it works unambiguously, you're all set.
'''tutrnsapmi? (Lappland)'''  ''e'o ko ciksi lenu basti zo tutrnsami .iki'uma zo tutrnsapmi xagmau gi'a dramau''
** [[User:4D enthusiast|4D enthusiast]] ([[User talk:4D enthusiast|talk]]) 07:41, 27 December 2016 (PST)
 
**: An NFA won't work e.g. zoi.{ab}^n.q.{ab}^m.la.{ab}^p is valid for some p iff n ≠ m, so it's not regular. A slightly more complicated example would show that it's not even context-free.
ni'o vi'o  .i so'a le lojbo kulnu valsi cu se jicmu le cmene be le tutra .enai le kulnu  .i zoi sy S�pmi sy kulnrsapmi cmene le tutra .ije zoi sy S�mi sy kulnrsapmi cmene le kulnu
* Okay. I read thoroughly the brkwords.txt file. The algorithm described there can be coded with a "classic" programming language, which means (according to Church's law) that a Turing machine, or a possibly infinite-state automaton, can morphologically parse (using this algorithm) any lojban text.
 
** However, there still remain two problems:
''je'e .i cinri gi'e xamgu''
** First, we are still not sure that brkwords.txt describes the reverse of the word formation rules described in the refgram. In other words, it is not sure that is is exactly the reverse of the WordFormationAlgorithm. If it is, all is well (we have an algorithm which given a finite-sized word can classify it into a unique lexeme). If it is not (proving it or not), it won't mean anything, because it is not yet proven impossible to invert the formation algorithm by some other way.
 
** In the case where brkwords.txt describes the opposite of the formation algorithm (which I doubt, but...), it is very unsatisfying. In particular, running this algorithm requires parsing the text in both directions from any "important" point. More practically, it requires the reader (be it computer or human) to memorize the whole of the text (when it is not cmavo-prefixed) upto the first pause (which may occur at an arbitrary point of time) before starting to parse. Although this is allowable for computers, I find this constraint very cumbersome as a human listener.
----
** At any rate, I would far prefer to invert the word formation algorithm.
 
*** All very well, except that there is no WFA for [[type 4 fu'ivla]].
mi (to xuzo'o mi darsi lenu cusku lesedu'u mi'e [[User:Nick Nicholas|nitcion]]. toi) jinvi ledu'u zo madiara na mapti zo ckiperi,a
**** Well, we have a general rule to recognize brivla, and morphology rules for every other type of brivla. Therefore a parser which can recognize a brivla will recognize type 4 fu'ivla using its "default" brivla rule.
 
* [[Jay Kominek]]:
.i le re valsi cu pilno lo sance pe le mokla gapru (to zoigy. palatals gy. toi) .i bilga lenu jdice lenu roroi pilno lo mokla tirxe (to zoigy. velar gy. toi) jonai crane (to zoigy. alveolar gy. toi)
*: It does not mean that a listener has to remember the string all the way up until the first pause. Humans are capable of performing lexical, syntactic and semantic processing of language simultaneously. As the lexical part of your mind tries to split up words, the syntactic part is looking to see if the series of words is valid, while the semantic part is looking to see if it makes sense. You'll discard parses which aren't syntactically or semantically valid. Further, there are parsing algorithms which won't be tripped up by bidirectionality of processing, so a computer isn't going to have a very big problem with it, <u>just as long as it isn't ambiguous</u>.
 
.i seni'ibo xamgu fa lenu pilno zo madiara .ecabo zo ctiperi,a .ije xamgu fa lenu pilno zo magiara .ecabo zo ckiperi,a .i ku'i na xamgu fa lenu zo madiara joi zo ckiperi,a se vasru pa liste
 
''.i mi'e xod .i .ienai .i pe'i zo gugrmadjara cu mapti zoi gy. mydiyr .gy .i ku'i zo ckiperi,a cu mapti zoi gy. ckiptar .gy .iseni'ibo le fu'ivla cu sance mapti .i pe'i la'e di'u rairvai''
 
'''mi'e [[User:Nick Nicholas|nitcion]]. .i pe'i do na jimpe le du'u le ca tcini na sance mapti .i mi krefu ciksi'''
 
'''.i ca'i la'ogy. F gy. krati le la'ogy. IPA gy. sinxe pe le mokli gapru voska dicrysance (to zoigy. palatal voiced stop --- boy, is Lojbab ever going to have farm lujvo here gy. toi) '''
 
'''.i la lojban. cu vasru le mokla tirxe jonai crane dicrysance .i na vasru le mokla gapru sance zoi cartu'''
 
Stops (IPA; Lojban equivalents in {}):
 
p {p}  t {t}  c {--}  k {k}  q {--}
 
b {b}  d {d}  F {--}  g {g}  G {--}
 
'''cartu .i zoigy. Shqipe"ria gy. cu sance dunli zoigy. S''c''ipEria gy. .enai zoigy. S''k''ipEria gy. .i zoixy. Magyar xy. cu sance dunli zoigy. maFar gy. .enai zoigy. madiar. gy.'''
 
'''.i gonai ky fa'u ky.ibu fa'u ty.ibu vu'o pe la lojban. krati kybu pe zoigy. Shqipe"ria gy.'''
 
'''gi gy fa'u gy.ibu fa'u dy.ibu vu'o pe la lojban. krati gy.y'ybu pe zoixy. Magyar xy.'''
 
'''.i cabna tcini fa lenu ge da jdice lenu pilno ky. .eseni'iboda'i gy. --- gi de jdice lenu pilno ty.ibu da'i .eni'ibo dy.ibu'''
 
'''.i lenu go'i cu rinka leka le re valsi na simymapti leka sance krati keikei poi rinka lenu cfipu'''
 
'''.imu'ubo mi tavla fi la'ogy. Gjirokaste"r gy. noi tcadu vi le gugdrckiiperi,a .i zoixy. gj xy. pe le bangrckiiperi,a cu ba'e sance mintu zoixy. gy xy. pe le bangrmagiara .i mi pilno .ei zo giirokastyr. ji zo diirokastyr.'''
 
'''mu'o'''
 
''mi'e xod .i ko tcidu [http://nuzban.wiw.org/wiki/index.php?gugrmadjara i'e] poi sera'a zoi gy. Hungary .gy''
 
'''.i mi -- ''mi'e [[User:Nick Nicholas|nitcion]].'' -- djuno ledu'u mi la'a co'a jai fanza fai lenu nalcenba .i ku'i pe'isai gonai do tirna zoigy. ''my,diyr'' gy. ca lenu bacru zoigy. ''Magyar'' gy.; gi do tirna zoigy. ''ctiptar'' gy. ca lenu bacru zoigy. ''Shqiptar'' gy. .i ?xu do se slabu su'o kulnrctipyri,a prenu .i cazi na go'i .u'i'''
 
'''.i le kulnu pe la .uikis. za'a na fapro leka simxu nalmapti kei ki'u lesi'o zmanei le kamjmina le kamlebna kei .iepei .i ku'i pe'icai le vi liste cu ba'a jicmu le ba vlacku liste; gi'e seki'ubo .ei zmadu lei drata pagbu be la .uikis pe dei leni se javni'''
 
'''.i la .iVAN cu gunta mi ki'u lenu mi troci lo simsa kei vecu'u la'ogy. [http://balance.wiw.org/~jkominek/lojban/9203/msg00043.html] gy. .i mi ruble jinvi ledu'u lenu gunta cu drani .i lenu stidi loi lojbo krati lerfu be lei vrici sance cu pe'i sarcu .i mi nupre lenu mi ba stidi .i mu'o'''
 
''mi'e xod .i mi na smuni jimpe le do jufra be sera'a lu gonai li'u. .isejubo mi ba'o tirna le mi pendo seni'i le du'u nodada'o krinu na'ebo tu'a zoi gy. mydiyr .gy''
 
''ni'o da poi kulnrckipta cu pendo mi .i mi pu'o terpreti''
 
----
 
=== gugrmerika ===
 
.iku'i cmene naldrani semu'i lo'e '''bragu'ebilma''' .i ca'a se smuni lo bemjoltco tutra
 
''.i fadni srera dada'o .i no da gugde po'u na'ebo la -united states of america. cu selcmene la amerika .i ja'o la amerika. po'o cu smuni zo gugrmerika''
 
-...but '''plz not''' in Lojban!
 
''Please explain this comment.''
 
Is this really necessary - read above! But in short:
 
Should a word like "America", standing for a whole continent, be included into Lojban that's claiming to be a *culturally neutral* language, just because the people of those, say, 49  states are accustomed to sing "God bless *America*"??? This would be arrogant, and the expression of a pretty silly chauvinism (''bragu'ebilma'') toward all other countries of America (a well-known attitude of the US during many past decades, e.g. with regard to Cuba, Guatemala etc.). And, as you quoted above: it's "United States '''of''' America", i.e. a - forgiveness! - small part of the whole!
 
BTW, what is {no da gugde po'u...}? "No X being a country" (shouldn't there at least be a '''poi''' inserted? {no da poi gugde ku'o po'u...})
 
And, if you really wish to name the US by "America", use a '''cmene''' (which really should be written {la .amerikas.}).
 
''le'osai .i a'enai ko zgana lo'u gugr le'u po'u ba'e zo gugde .i xo da gugde zo'u la'o xy. America .xy pagbu le cmene be dada'o .i ju'ocai pamei .i po'o drata pilno la'o xy. America .xy le braplu po'u la amerikas. .i ja'ocai zo daplrmerika cu sinxa le braplu po'u la amerikas. .i ji'a ja'ocai zo gugrmerika cu satci sinxa pa gugde po'o le do bebna cai xusra be tu'a le bragu'ebilma .ionai cu mabla mu'i ma mi pilno le cmene be le gugrmerika va'o dada'o poi ke'a barda likckupau secu'u so'i le fu'ivla be le'a li 3 be'o fi'o selsni so'i le gugde .i xu do jinvi le du'u le gugrmerika cu nadfadni pa'enai mi'e xod''
 
----

Latest revision as of 15:41, 27 December 2016

See the BPFK's Morphology Forum

I think I could not spent one day more without creating a wiki page for this issue.

While Lojban has a very precise, strict, easy-to-parse, thus unambiguous grammar, it has not been proved syntactically unambiguous.

Let me rephrase this statement.

If we consider the Lojban language in a purely formal sense, involving an alphabet composed of its words, supposedly classified into lexemes, then any finite string of lojban (stream of words) can be proven either grammatically correct or not in a finite amount of time using a Turing machine.

This is what the Lojban grammar is for: it defines axioms and inference rules between its words (more precisely its lexemes) in a form that can be fed into a LALR(1) parser. And because the LALR(1) parsing algorithm can be implemented using a Turing Machine, and that it can tell, within a finite amount of time, whether a (finite) string of lexemes follow its grammar or not, we have the result stated above.

Now, what does that mean ?

It means that given any stream of Lojban text, if we are able to distinguish its words and categorize them into lexemes, then (roughly) we can parse the text using a computer; however it actually presupposes that we know a way to distinguish the words and categorize them.

And so ?

Two facts:

  • To this day, the way to distinguish the words in a lojban stream of text is very, very poorly documented (from a computer programmer's point of view). The only rules are given in the refgram, are vague, and were never proven unambiguous.
  • The most precise lojban lexer to this day, the morph algorithm of jbofi'e, has not been proved complete nor unambiguous, while already beginning to be hugely complex. IIRC, the more a system is complex, the less it is credible to the human mind.

Discussion

As far as I can understand in the morphology of lojban, a complete lexer simply cannot be written using a LR(k) nor LL(k) grammar, for various reasons related to backtracking and the relation between semantics and morphology.

  • What does anything have to do with semantics? brkwords.txt will correctly break up djan.polake'ebrIvlaboinoigIsmumu just as well as something grammatical.
    • However, we still have a hard time with "su" and "sa", even "si" when it is used to erase a morphologically wrong piece of text. Not to mention that it is possible that the morphology in itself is possible, while altogether being unambiguous when considered together with the grammar (like an ambiguity that cannot happen because the ambiguous lexemes are not allowed in such or such place by the grammar).
  • The attempt in jbofi'e to solve the problem using NFA's is interesting, but leads to a huge graph of automata, which is very difficult to prove right.
  • Elrond:
    While continuing to search for an elegant algorithmic solution to this issue, I nonetheless believe that real parsing of Lojban using computers (one which involves a real lexer) is a very difficult task unlikely to be done right soon.
  • One should see the file BRKWORDS.TXT for the official LLG morphology algorithm.
    • Definitely not official!
      • It is what the official LLG parser uses, isn't it?
        • No. As noted elsewhere, the official parser can only segment compound cmavo, and otherwise relies on spaces/dots (which it considers the same).
    • It seems to work quite well, but it could probably be polished up a little on type 4 fu'ivla.
  • Jay:
    At one point I put some thought into trying to write a context free grammar describing the morphology. (Not a context-free one, however. I was intending to use the grammar in conjunction with a generalized parser, like Tomita's or Early's). Didn't get anywhere. I think that if an NFA can parse it, there should definitely be a CFG which can describe it, however. ((CFGs being more powerful language-recognition mechanisms than automata.)
    • Jay Kominek:
      Btw, parsing, even with an NFA is Really Really fast. So even if the process is ugly, then just as long as it works unambiguously, you're all set.
    • 4D enthusiast (talk) 07:41, 27 December 2016 (PST)
      An NFA won't work e.g. zoi.{ab}^n.q.{ab}^m.la.{ab}^p is valid for some p iff n ≠ m, so it's not regular. A slightly more complicated example would show that it's not even context-free.
  • Okay. I read thoroughly the brkwords.txt file. The algorithm described there can be coded with a "classic" programming language, which means (according to Church's law) that a Turing machine, or a possibly infinite-state automaton, can morphologically parse (using this algorithm) any lojban text.
    • However, there still remain two problems:
    • First, we are still not sure that brkwords.txt describes the reverse of the word formation rules described in the refgram. In other words, it is not sure that is is exactly the reverse of the WordFormationAlgorithm. If it is, all is well (we have an algorithm which given a finite-sized word can classify it into a unique lexeme). If it is not (proving it or not), it won't mean anything, because it is not yet proven impossible to invert the formation algorithm by some other way.
    • In the case where brkwords.txt describes the opposite of the formation algorithm (which I doubt, but...), it is very unsatisfying. In particular, running this algorithm requires parsing the text in both directions from any "important" point. More practically, it requires the reader (be it computer or human) to memorize the whole of the text (when it is not cmavo-prefixed) upto the first pause (which may occur at an arbitrary point of time) before starting to parse. Although this is allowable for computers, I find this constraint very cumbersome as a human listener.
    • At any rate, I would far prefer to invert the word formation algorithm.
      • All very well, except that there is no WFA for type 4 fu'ivla.
        • Well, we have a general rule to recognize brivla, and morphology rules for every other type of brivla. Therefore a parser which can recognize a brivla will recognize type 4 fu'ivla using its "default" brivla rule.
  • Jay Kominek:
    It does not mean that a listener has to remember the string all the way up until the first pause. Humans are capable of performing lexical, syntactic and semantic processing of language simultaneously. As the lexical part of your mind tries to split up words, the syntactic part is looking to see if the series of words is valid, while the semantic part is looking to see if it makes sense. You'll discard parses which aren't syntactically or semantically valid. Further, there are parsing algorithms which won't be tripped up by bidirectionality of processing, so a computer isn't going to have a very big problem with it, just as long as it isn't ambiguous.