letter Frequencies

From Lojban
Jump to navigation Jump to search

I've just generated new letter frequency data based on all but the

first section of:

test_sentences.txt

So basically, the CLL, Alice, and a bunch of IRC. If people would

like to suggest other non-trivially sized Lojban texts to add,

please let me know, but we've got ~650K characters here, so I think

the statistics is pretty good.

My data, sorted by number of occurences:

85004 i

68959 a

52225 e

50517 u

47944 o

43807 l

36358 n

33169 c

27097 m

24514 r

22989 s

21356 d

20536 '

18317 t

17749 k

14459 b

13359 p

11990 j

8810 g

8007 z

6857 v

6616 x

6288 f

4580 y

As ratios:

0.130472888242183 i

0.105845370809523 a

0.080160305261493 e

0.077538691065483 u

0.073589385839292 o

0.067239492438300 l

0.055806000549495 n

0.050911195121464 c

0.041591264560472 m

0.037626610305031 r

0.035285883344307 s

0.032779386867677 d

0.031520766469124 '

0.028114816878406 t

0.027242992016969 k

0.022193161393507 b

0.020504768175936 p

0.018403486071523 j

0.013522494769818 g

0.012289967720991 z

0.010524829357167 v

0.010154917752226 x

0.009651469592805 f

0.007029855396795 y

The only previous work on this I'm aware of is:

he Scrabble Paper

Which, it turns out, is amazingly flawed (which is fine, because

that was a long time ago!).

Using the data without lujvo, we have:

i 1045

a 991

u 642

n 563

e 496

r 460

o 395

t 361

c 360

l 348

s 339

' 316

k 285

m 254

j 249

d 219

b 212

p 203

f 149

g 146

v 119

x 108

z 87

y 19

which is only marginally different from what I have.

Using the data with lujvo, however, which IIRC is what the Scrabble

frequencies were based on, we have the obviously biased:

y 5553

r 2979

a 2949

i 2678

n 2047

u 1755

e 1560

l 1395

s 1363

t 1359

k 1107

m 1048

o 1046

c 1040

' 1012

j 1008

p 872

b 865

d 862

f 616

g 589

x 532

v 490

z 359

-Robin Lee Powell