word Resolution Algorithm: Difference between revisions

From Lojban
Jump to navigation Jump to search
mNo edit summary
 
mNo edit summary
Line 1: Line 1:


.i a'o le fu'ivla pu'o basti di'e poi gismu (basti: x1 replaces/substitutes for/instead of x2...) e'o ko ciksi tu'a le do jufra (to
See the [[BPFK|BPFK]]'s [http://www.lojban.org/phpbb/viewforum.php?f=72 orphology Forum]


di'e na'e zasti gismu toi)
I think I could not spent one day more without creating a wiki page for this issue.


''i mi se slabu le terbri be zo basti i su'o jbopli cu zmanei le catni nalselzanru gismu (to le gismu cu zasti i na se liste le catni gimste vau po'o toi) i mo'a da kulnu gismu ije le se klani be loi gismu tarmi cu banzu le nu finti lo selsumji gismu i mi na [[jbocre: smujmi|smujmi]] le do jufra i do cusku no da poi se lidne zo di'e''
'''While Lojban has a very precise, strict, easy-to-parse, thus unambiguous ''grammar'', it has ''not'' (emphasized) been proved ''syntactically'' unambiguous.'''


le se lidne be zo di'e pupu zasti pu lenu do .o lo drata jbopli cu setca pa jufra
Let me rephrase this statement.


''i ko tcidu la'e zoi zoi [http://groups.yahoo.com/group/lojban/message/2137] zoi''
If we consider the Lojban language in a purely formal sense, involving an '''alphabet''' composed of its '''words''', supposedly classified into '''lexemes''', then '''any finite ''string''''' of lojban (stream of '''words''') can be proven either grammatically correct or not in a finite amount of time using a turing machine.


zo gugrnorge basti zo norgo
This is what the Lojban grammar is for : it defines axioms and inference rules between its words (more precisely its '''lexemes''') in a form that can be fed into a [[jbocre: LALR|LALR]](1) parser. And because the [[jbocre: LALR|LALR]](1) parsing algorithm can be implemented using a [[jbocre: Turing Machine|Turing Machine]], and that it can tell, within a finite amount of time, whether a (finite) string of lexemes follow its grammar or not, we have the result stated above.


zo gugrtalia basti zo talno
Now, what does that mean ?


zo gugrturki,e basti zo turko
It means that given any stream of Lojban text, if we are able to distinguish its words and categorize them into lexemes, then (roughly) we can parse the text using a computer; however it actually ''presupposes'' that we know a way to distinguish the words and categorize them.


zo gugrxrvatska basti zo xorvo
And so ?


zo kri'ibe basti zo kribo
Two facts :


''.i pu tcidu .iku'i sruma le du'u do na nelci lo nalselzanru gismu .isemu'ibo do djica lenu basti ra kei lenu pilno lo catni fu'ivla po'o (to pe'i le su'o jbopli poi zmanei le catni nalselzanru gismu ku'o cu me na'ebo do gi'e me ma zo'o toi)''
* To this day, the way to distinguish the words in a lojban stream of text is very, very poorly documented (from a computer programmer's point of view). The only rules are given in the [[jbocre: refgram|refgram]], are vague, and '''were never proven unambiguous'''.


i mi je'a nelci le'e catni nalselzanru gismu i ki'u ma do sruma i mi cusku le se du'u mi pacna le nu le catni nalselzanru gismu cu basti le fu'ivla
* The most precise lojban lexer to this day, the morph algorithm of [[jbocre: jbofi'e|jbofi'e]], has not been proved complete nor unambiguous, while already beginning to be hugely complex. IIRC, the more a system is complex, the less it is credible to the human mind.


''cu '''se''' basti le fu'ivla''
==== Epilog ====


i a'unai ienaicai i paunai le nu jimpe ma cu nandu i mi zmanei le gismu poi na'e se zanru le catni ku'o le fu'ivla gi'e pacna le du'u le gismu cu basti le fu'ivla to mi zmanei gy toi i mi jimpe le du'u do na tugni mi
As far as I can understand in the morphology of lojban, a complete lexer simply cannot be written using a [[jbocre: LR|LR]](k) nor [[jbocre: LL|LL]](k) grammar, for various reasons related to backtracking and the relation between semantics and morphology.


[[jbocre: .i tu'e sei cnino cusku .i mi co'a nelci .ueru'e lei fu'ivla bele'a li vo be'o noi se cmima ji'a lei nalselzanru gismu .iku'i ba'e na go'i va'o lenu vo'e se pilno secau le ve ciksi .isa'unai mi tcidu da'i lo se karni .i le finti cu pilno .i'enai zo tci'ile .a zo spero .i le se go'i na catni lojbo valsi gi'e la'a cfipu mi .i .e'u le finti cusku .i'e lu le gugdrtcile cei tci'ile li'u gi'e punaijeba pilno zo tci'ile .i zo cei cu je'a se pilno xamgu pe'i .i lo'e finti cu .e'a ka'e galfi le bangu ba'epe le se finti be ri .i tu'u|.i tu'e sei cnino cusku .i mi co'a nelci .ueru'e lei fu'ivla bele'a li vo be'o noi se cmima ji'a lei nalselzanru gismu .iku'i ba'e na go'i va'o lenu vo'e se pilno secau le ve ciksi .isa'unai mi tcidu da'i lo se karni .i le finti cu pilno .i'enai zo tci'ile .a zo spero .i le se go'i na catni lojbo valsi gi'e la'a cfipu mi .i .e'u le finti cusku .i'e lu le gugdrtcile cei tci'ile li'u gi'e punaijeba pilno zo tci'ile .i zo cei cu je'a se pilno xamgu pe'i .i lo'e finti cu .e'a ka'e galfi le bangu ba'epe le se finti be ri .i tu'u]]
''What does anything have to do with semantics? brkwords.txt will correctly break up ''djan.polake'ebrIvlaboinoigIsmumu'' just as well as something grammatical. -- ''However, we still have a hard time with "su" and "sa", even "si" when it is used to erase a morphologically wrong piece of text. Not to mention that it is possible that the morphology in itself is possible, while altogether being unambiguous when considered together with the grammar (like an ambiguity that cannot happen because the ambiguous lexemes are not allowed in such or such place by the grammar).


----
The attempt in [[jbocre: jbofi'e|jbofi'e]] to solve the problem using [[jbocre: NFA|NFA]]'s is interesting, but leads to a huge graph of automata, which is very difficult to prove right.


'''tutrnsapmi? (Lappland)'''  ''e'o ko ciksi lenu basti zo tutrnsami .iki'uma zo tutrnsapmi xagmau gi'a dramau''
While continuing to search for an elegant algorithmic solution to this issue, I nonetheless believe that real parsing of Lojban using computers (one which involves a real lexer) is a very difficult task unlikely to be done right soon.


ni'o vi'o  .i so'a le lojbo kulnu valsi cu se jicmu le cmene be le tutra .enai le kulnu  .i zoi sy S�pmi sy kulnrsapmi cmene le tutra .ije zoi sy S�mi sy kulnrsapmi cmene le kulnu
--[[jbocre: Elrond|Elrond]]
 
''je'e .i cinri gi'e xamgu''


----
----


mi (to xuzo'o mi darsi lenu cusku lesedu'u mi'e [[User:Nick Nicholas|nitcion]]. toi) jinvi ledu'u zo madiara na mapti zo ckiperi,a
One should see the file [http://www.lojban.org/files/software/B BRKWORDS.TXT] for the official LLG morphology algorithm.


.i le re valsi cu pilno lo sance pe le mokla gapru (to zoigy. palatals gy. toi) .i bilga lenu jdice lenu roroi pilno lo mokla tirxe (to zoigy. velar gy. toi) jonai crane (to zoigy. alveolar gy. toi)
* Definitely '''not''' official!
** It is what the official LLG parser uses, isn't it?


.i seni'ibo xamgu fa lenu pilno zo madiara .ecabo zo ctiperi,a .ije xamgu fa lenu pilno zo magiara .ecabo zo ckiperi,a .i ku'i na xamgu fa lenu zo madiara joi zo ckiperi,a se vasru pa liste
*** No. As noted elsewhere, the official parser can only segment compound cmavo, and otherwise relies on spaces/dots (which it considers the same).


''.i mi'e xod .i .ienai .i pe'i zo gugrmadjara cu mapti zoi gy. mydiyr .gy .i ku'i zo ckiperi,a cu mapti zoi gy. ckiptar .gy .iseni'ibo le fu'ivla cu sance mapti .i pe'i la'e di'u rairvai''
''It seems to work quite well, but it could probably be polished up a little on type 4 fu'ivla.''


'''mi'e [[User:Nick Nicholas|nitcion]]. .i pe'i do na jimpe le du'u le ca tcini na sance mapti .i mi krefu ciksi'''
----
 
'''.i ca'i la'ogy. F gy. krati le la'ogy. IPA gy. sinxe pe le mokli gapru voska dicrysance (to zoigy. palatal voiced stop --- boy, is Lojbab ever going to have farm lujvo here gy. toi) '''
 
'''.i la lojban. cu vasru le mokla tirxe jonai crane dicrysance .i na vasru le mokla gapru sance zoi cartu'''
 
Stops (IPA; Lojban equivalents in {}):
 
p {p}  t {t}  c {--}  k {k}  q {--}
 
b {b}  d {d}  F {--}  g {g}  G {--}
 
'''cartu .i zoigy. Shqipe"ria gy. cu sance dunli zoigy. S''c''ipEria gy. .enai zoigy. S''k''ipEria gy. .i zoixy. Magyar xy. cu sance dunli zoigy. maFar gy. .enai zoigy. madiar. gy.'''
 
'''.i gonai ky fa'u ky.ibu fa'u ty.ibu vu'o pe la lojban. krati kybu pe zoigy. Shqipe"ria gy.'''
 
'''gi gy fa'u gy.ibu fa'u dy.ibu vu'o pe la lojban. krati gy.y'ybu pe zoixy. Magyar xy.'''
 
'''.i cabna tcini fa lenu ge da jdice lenu pilno ky. .eseni'iboda'i gy. --- gi de jdice lenu pilno ty.ibu da'i .eni'ibo dy.ibu'''
 
'''.i lenu go'i cu rinka leka le re valsi na simymapti leka sance krati keikei poi rinka lenu cfipu'''
 
'''.imu'ubo mi tavla fi la'ogy. Gjirokaste"r gy. noi tcadu vi le gugdrckiiperi,a .i zoixy. gj xy. pe le bangrckiiperi,a cu ba'e sance mintu zoixy. gy xy. pe le bangrmagiara .i mi pilno .ei zo giirokastyr. ji zo diirokastyr.'''
 
'''mu'o'''
 
''mi'e xod .i ko tcidu [http://nuzban.wiw.org/wiki/index.php?gugrmadjara i'e] poi sera'a zoi gy. Hungary .gy''
 
'''.i mi -- ''mi'e [[User:Nick Nicholas|nitcion]].'' -- djuno ledu'u mi la'a co'a jai fanza fai lenu nalcenba .i ku'i pe'isai gonai do tirna zoigy. ''my,diyr'' gy. ca lenu bacru zoigy. ''Magyar'' gy.; gi do tirna zoigy. ''ctiptar'' gy. ca lenu bacru zoigy. ''Shqiptar'' gy. .i ?xu do se slabu su'o kulnrctipyri,a prenu .i cazi na go'i .u'i'''
 
'''.i le kulnu pe la .uikis. za'a na fapro leka simxu nalmapti kei ki'u lesi'o zmanei le kamjmina le kamlebna kei .iepei .i ku'i pe'icai le vi liste cu ba'a jicmu le ba vlacku liste; gi'e seki'ubo .ei zmadu lei drata pagbu be la .uikis pe dei leni se javni'''
 
'''.i la .iVAN cu gunta mi ki'u lenu mi troci lo simsa kei vecu'u la'ogy. [http://balance.wiw.org/~jkominek/lojban/9203/msg00043.html] gy. .i mi ruble jinvi ledu'u lenu gunta cu drani .i lenu stidi loi lojbo krati lerfu be lei vrici sance cu pe'i sarcu .i mi nupre lenu mi ba stidi .i mu'o'''


''mi'e xod .i mi na smuni jimpe le do jufra be sera'a lu gonai li'u. .isejubo mi ba'o tirna le mi pendo seni'i le du'u nodada'o krinu na'ebo tu'a zoi gy. mydiyr .gy''
At one point I put some thought into trying to write a [[jbocre: Context Free Grammars ontext free grammar|Context Free Grammars ontext free grammar]] describing the morphology. (Not a context-free one, however. I was intending to use the grammar in conjunction with a generalized parser, like [[jbocre: General Parsing omita|General Parsing omita]]'s or [[jbocre: General Parsing arly|General Parsing arly]]'s). Didn't get anywhere. I think that if an [[jbocre: NFA|NFA]] can parse it, there should definitely be a [[jbocre: CFG|CFG]] which can describe it, however. ([[jbocre: CFG|(CFG]]s being more powerful language-recognition mechanisms than automata.) --[[jbocre: Jay Kominek|Jay]]


''ni'o da poi kulnrckipta cu pendo mi .i mi pu'o terpreti''
Btw, parsing, even with an [[jbocre: NFA|NFA]] is Really Really fast. So even if the process is ugly, then just as long as it works unambiguously, you're all set. -[[jbocre: Jay Kominek|Jay Kominek]]


----
----


=== gugrmerika ===
Okay. I read thoroughly the brkwords.txt file. The algorithm described there can be coded with a "classic" programming language, which means (according to Church's law) that a Turing machine, or a possibly infinite-state automaton, can morphologically parse (using this algorithm) any lojban text.
 
.iku'i cmene naldrani semu'i lo'e '''bragu'ebilma''' .i ca'a se smuni lo bemjoltco tutra
 
''.i fadni srera dada'o .i no da gugde po'u na'ebo la -united states of america. cu selcmene la amerika .i ja'o la amerika. po'o cu smuni zo gugrmerika''
 
-...but '''plz not''' in Lojban!


''Please explain this comment.''
However, there still remain two problems :


Is this really necessary - read above! But in short:
* First, we are still not sure that brkwords.txt describes the reverse of the word formation rules described in the refgram. In other words, it is not sure that is is exactly the reverse of the WordFormationAlgorithm. If it is, all is well (we have an algorithm which given a finite-sized word can classify it into a unique lexeme). If it is not (proving it or not), it won't mean anything, because it is not yet proven impossible to invert the formation algorithm by some other way.


Should a word like "America", standing for a whole continent, be included into Lojban that's claiming to be a *culturally neutral* language, just because the people of those, say, 49  states are accustomed to sing "God bless *America*"??? This would be arrogant, and the expression of a pretty silly chauvinism (''bragu'ebilma'') toward all other countries of America (a well-known attitude of the US during many past decades, e.g. with regard to Cuba, Guatemala etc.). And, as you quoted above: it's "United States '''of''' America", i.e. a - forgiveness! - small part of the whole!
* In the case where brkwords.txt describes the opposite of the formation algorithm (which I doubt, but...), it is very unsatisfying. In particular, running this algorithm requires parsing the text in both directions from any "important" point. More practically, it requires the reader (be it computer or human) to memorize the whole of the text (when it is not cmavo-prefixed) upto the first pause (which may occur at an arbitrary point of time) before starting to parse. Although this is allowable for computers, I find this constraint very cumbersome as a human listener.


BTW, what is {no da gugde po'u...}? "No X being a country" (shouldn't there at least be a '''poi''' inserted? {no da poi gugde ku'o po'u...})
At any rate, I would far prefer to invert the word formation algorithm.


And, if you really wish to name the US by "America", use a '''cmene''' (which really should be written {la .amerikas.}).
* All very well, except that there is no WFA for [[jbocre: Type 4 fu'ivla]]. -- ''Well, we have a general rule to recognize brivla, and morphology rules for every other type of brivla. Therefore a parser which can recognize a brivla will recognize type 4 fu'ivla using its "default" brivla rule.''


''le'osai .i a'enai ko zgana lo'u gugr le'u po'u ba'e zo gugde .i xo da gugde zo'u la'o xy. America .xy pagbu le cmene be dada'o .i ju'ocai pamei .i po'o drata pilno la'o xy. America .xy le braplu po'u la amerikas. .i ja'ocai zo daplrmerika cu sinxa le braplu po'u la amerikas. .i ji'a ja'ocai zo gugrmerika cu satci sinxa pa gugde po'o le do bebna cai xusra be tu'a le bragu'ebilma .ionai cu mabla mu'i ma mi pilno le cmene be le gugrmerika va'o dada'o poi ke'a barda likckupau secu'u so'i le fu'ivla be le'a li 3 be'o fi'o selsni so'i le gugde .i xu do jinvi le du'u le gugrmerika cu nadfadni pa'enai mi'e xod''
It does not mean that a listener has to remember the string all the way up until the first pause. Humans are capable of performing lexical, syntactic and semantic processing of language simultaneously. As the lexical part of your mind tries to split up words, the syntactic part is looking to see if the series of words is valid, while the semantic part is looking to see if it makes sense. You'll discard parses which aren't syntactically or semantically valid. Further, there are parsing algorithms which won't be tripped up by bidirectionality of processing, so a computer isn't going to have a very big problem with it, ''just as long as it isn't ambiguous''. --jay
 
----

Revision as of 17:20, 4 November 2013

See the BPFK's orphology Forum

I think I could not spent one day more without creating a wiki page for this issue.

While Lojban has a very precise, strict, easy-to-parse, thus unambiguous grammar, it has not (emphasized) been proved syntactically unambiguous.

Let me rephrase this statement.

If we consider the Lojban language in a purely formal sense, involving an alphabet composed of its words, supposedly classified into lexemes, then any finite string of lojban (stream of words) can be proven either grammatically correct or not in a finite amount of time using a turing machine.

This is what the Lojban grammar is for : it defines axioms and inference rules between its words (more precisely its lexemes) in a form that can be fed into a LALR(1) parser. And because the LALR(1) parsing algorithm can be implemented using a Turing Machine, and that it can tell, within a finite amount of time, whether a (finite) string of lexemes follow its grammar or not, we have the result stated above.

Now, what does that mean ?

It means that given any stream of Lojban text, if we are able to distinguish its words and categorize them into lexemes, then (roughly) we can parse the text using a computer; however it actually presupposes that we know a way to distinguish the words and categorize them.

And so ?

Two facts :

  • To this day, the way to distinguish the words in a lojban stream of text is very, very poorly documented (from a computer programmer's point of view). The only rules are given in the refgram, are vague, and were never proven unambiguous.
  • The most precise lojban lexer to this day, the morph algorithm of jbofi'e, has not been proved complete nor unambiguous, while already beginning to be hugely complex. IIRC, the more a system is complex, the less it is credible to the human mind.

Epilog

As far as I can understand in the morphology of lojban, a complete lexer simply cannot be written using a LR(k) nor LL(k) grammar, for various reasons related to backtracking and the relation between semantics and morphology.

What does anything have to do with semantics? brkwords.txt will correctly break up djan.polake'ebrIvlaboinoigIsmumu just as well as something grammatical. -- However, we still have a hard time with "su" and "sa", even "si" when it is used to erase a morphologically wrong piece of text. Not to mention that it is possible that the morphology in itself is possible, while altogether being unambiguous when considered together with the grammar (like an ambiguity that cannot happen because the ambiguous lexemes are not allowed in such or such place by the grammar).

The attempt in jbofi'e to solve the problem using NFA's is interesting, but leads to a huge graph of automata, which is very difficult to prove right.

While continuing to search for an elegant algorithmic solution to this issue, I nonetheless believe that real parsing of Lojban using computers (one which involves a real lexer) is a very difficult task unlikely to be done right soon.

--Elrond


One should see the file BRKWORDS.TXT for the official LLG morphology algorithm.

  • Definitely not official!
    • It is what the official LLG parser uses, isn't it?
      • No. As noted elsewhere, the official parser can only segment compound cmavo, and otherwise relies on spaces/dots (which it considers the same).

It seems to work quite well, but it could probably be polished up a little on type 4 fu'ivla.


At one point I put some thought into trying to write a Context Free Grammars ontext free grammar describing the morphology. (Not a context-free one, however. I was intending to use the grammar in conjunction with a generalized parser, like General Parsing omita's or General Parsing arly's). Didn't get anywhere. I think that if an NFA can parse it, there should definitely be a CFG which can describe it, however. ((CFGs being more powerful language-recognition mechanisms than automata.) --Jay

Btw, parsing, even with an NFA is Really Really fast. So even if the process is ugly, then just as long as it works unambiguously, you're all set. -Jay Kominek


Okay. I read thoroughly the brkwords.txt file. The algorithm described there can be coded with a "classic" programming language, which means (according to Church's law) that a Turing machine, or a possibly infinite-state automaton, can morphologically parse (using this algorithm) any lojban text.

However, there still remain two problems :

  • First, we are still not sure that brkwords.txt describes the reverse of the word formation rules described in the refgram. In other words, it is not sure that is is exactly the reverse of the WordFormationAlgorithm. If it is, all is well (we have an algorithm which given a finite-sized word can classify it into a unique lexeme). If it is not (proving it or not), it won't mean anything, because it is not yet proven impossible to invert the formation algorithm by some other way.
  • In the case where brkwords.txt describes the opposite of the formation algorithm (which I doubt, but...), it is very unsatisfying. In particular, running this algorithm requires parsing the text in both directions from any "important" point. More practically, it requires the reader (be it computer or human) to memorize the whole of the text (when it is not cmavo-prefixed) upto the first pause (which may occur at an arbitrary point of time) before starting to parse. Although this is allowable for computers, I find this constraint very cumbersome as a human listener.

At any rate, I would far prefer to invert the word formation algorithm.

  • All very well, except that there is no WFA for jbocre: Type 4 fu'ivla. -- Well, we have a general rule to recognize brivla, and morphology rules for every other type of brivla. Therefore a parser which can recognize a brivla will recognize type 4 fu'ivla using its "default" brivla rule.

It does not mean that a listener has to remember the string all the way up until the first pause. Humans are capable of performing lexical, syntactic and semantic processing of language simultaneously. As the lexical part of your mind tries to split up words, the syntactic part is looking to see if the series of words is valid, while the semantic part is looking to see if it makes sense. You'll discard parses which aren't syntactically or semantically valid. Further, there are parsing algorithms which won't be tripped up by bidirectionality of processing, so a computer isn't going to have a very big problem with it, just as long as it isn't ambiguous. --jay