HTML5 文字雲

Posted on March 20, 2011 by Timothy Chien (blog)

This content is over 13 years old. It may be obsolete and may not reflect the current opinion of the author.

Update: Here is the introduction in English.

猜猜看這是哪個 blog 的最常出現的詞彙？

文字雲是最近在嘗試的 HTML5 瀏覽器 Demo，目標是使用前端技術來完成 Java 撰寫的 Wordle，順便可以拿來測試瀏覽器效能，以及分析一些文字來源。

目前程式可以接受硬碟裡的檔案或是 RSS Feed 的文字（從 Google Feed API 拉資料），經過詞頻分析，以出現的數量來決定文字大小，排進 HTML5 canvas，像這是這個 blog 的分析結果：

最前面貼的第一張圖則是來自地圖會說話，其他像是 New York Times 的文字雲可以看出最近的新聞 buzz word 是什麼。

文章詞頻分析

中文用的是很標準的 N-gram（我做到 N=6），把字組切出來算頻率。英文的話則引用了資料庫常用的 Porter Stemming Algorithm 來歸一化字的變形（複數型、正在進行式等等）。想起來中文的分析還真的是知難行易，完全不做任何過濾效果就會很好（後來有濾掉一些不是詞的字組就是），而且唯一的 stop word 只有「的」。

測試的時候懷疑瀏覽器執行這些文字操作的效能，所以就把這段程式碼放進 Web Workers。放完發現效能沒有提昇多少，真正卡住的東西是畫 canvas … 不過還是保持這樣，因為使用者有可能會放大量文字進來（但只差 Web Workers 就可以執行的 IE9 就抱歉了）

使用前端額外的好處是使用者可以載入自己的檔案來分析，而且檔案不需要傳到主機上。測試的時候曾經跟別人要了在學校寫的報告，圖像很好玩。大部分的部落格最大的詞都是「可以」，不知道是不是口語中文的現象。

文字雲圖像

圖則是仰賴 canvas 的填入文字（fillText）功能。有一個二維 Array 來儲存哪裡還有空間的資訊，每次填入文字之後再用 getPixelData 偵測文字筆劃的位置，更新 Array。在 Chrome 上面，文字大小會受設定裡面的最小字型大小的影響，所以最後會有很多不太小的小字。另外就是 Firefox 的 getPixelData 效能有問題，只有別人的 1/6 甚至更慢，為了避免瀏覽器卡死只好放了偵測所花時間，超過就跳出警告訊息的程式碼。Firefox 另外還有一個 canvas 有更新但是卻不顯示結果的 bug，不是很 reproducible 而且好像每台電腦行為都不一樣，只好先放著（canvas 的效能和問題都跟硬體加速有關）。

使用者介面

這東西為何會被我放在硬碟裡快一個月就是因為 UI 搞不定 orz … 做了兩個版本，每次 show 給別人看的時候都很容易操作錯誤；現在這個是 2.5 版，流程變成仰賴網址的 Hash 和 onhashchange 事件。對於寫出有用但是 UI 失敗的程式，我也感到很無奈 orz

接下來？

歡迎大家玩玩看。字頻分析和文字雲兩個功能都包成 library 了，拿去做其他應用應該很方便。不過我希望先保留程式碼兩個星期，之後再公開；目前網站上的是 minified script。

有空的話會想要支援更多自然語言（日文、韓文等），還有拿字頻做其他 visual（或是 audio？Bob 給了一個的點子，我才注意到其實字頻分布跟 FFT 轉換後的聲音頻譜還滿像的？）

最後，喜歡的話賞我一個讚吧 XD 寫不少東西但是不太知道反應，也不知道要去哪裡推。

57 thoughts on “HTML5 文字雲”

Drake says:

March 21, 2011 at 10:46 am

試著餵給它 http://feeds.feedburner.com/drakeguan, 實在很驚人吶，整個有被一沱文字打到的感覺。這些文字只是來自 title 還是包含 content 呀?
0.0 says:

March 23, 2011 at 3:05 am

請問一下讀取文件功能怎麼都一直錯誤？
st says:

March 23, 2011 at 7:59 am

哈哈這玩意已經在我的河道上紅起來了 😆
是說我的blog第一名也是「可以」耶，真妙
完全沒意識到常用這個詞
Yunai says:

March 23, 2011 at 9:15 am

感覺真不錯XD
不過大概是現在太多人在玩了
我怎麼試都試不出來
毛毛牙 says:

March 23, 2011 at 10:32 am

好好玩阿！
AC says:

March 23, 2011 at 3:42 pm

不知道為什麼我的無法出現耶:(
噗浪ID(AIKU1208)一直跳出錯誤訊息 😥
路人小花 says:

March 23, 2011 at 7:20 pm

給AC

我把帳號把小寫改成大寫就可以弄出來
謬晤 says:

March 24, 2011 at 6:08 pm

我的噗浪一直無法產生文字雲耶！
不知道是哪個環節出錯。
一開始認為是因為私密河道的關係，但公開後一樣不能產生文字雲……
Peter says:

March 26, 2011 at 3:21 pm

不知道是不是因為太多人玩的關係，沒有辦法顯示出”文字雲”圖
不過這是一個好玩的東西
Rabbit says:

March 26, 2011 at 5:27 pm

為什麼用噗浪一直讀取失敗?
murmur says:

March 26, 2011 at 10:54 pm

很有趣！

不過
1.這個程式讀的到隱藏的網誌資料耶，太可怕了！

2.為什麼有些網誌帳號明明沒有隱藏，卻一直讀取失敗呢？
B says:

March 30, 2011 at 10:10 am

很喜欢你的主页的样式~请问能借来修改下做自己主页么？谢谢
SiriusLupin says:

March 30, 2011 at 10:43 pm

您好,我用無名小站的網誌帳號,一直顯示讀取失敗
請問是因為無名本身的問題或者是有什麼方法可以解決呢?
還煩請您答覆了~
謝謝
Pingback: Siu Sir » Blog Archive » Wordle
唯 says:

April 3, 2011 at 11:45 am

我發現如果文字內容有 _ 的話好像會抓不太到 _ 後面的字…?
Pingback: Word Cloud – Open source “Wordle” in HTML5 | Blog: timdream
Maixixi says:

April 16, 2011 at 1:51 am

起了雞皮疙瘩……
畫面中間總是斗大的中國，不管是親中還是非親中的網站……真的很有意思。GJ!
Bionta says:

April 22, 2011 at 2:34 pm

載入部落格是成功的
但試電腦裡的中文word檔時
出來的卻都是英文
是有設定上要特別注意的嗎?
Pingback: 替代役之食衣住行《食》 – CornGuo's BLOG, of murmurs
sabinechu says:

August 22, 2011 at 3:14 pm

非常有趣!給你一個讚
Pingback: 有趣的 HTML5 網站文字雲 « Heresy's Space
路過的 says:

October 9, 2011 at 8:18 pm

不錯
lisa says:

October 13, 2011 at 2:19 am

我的撲浪也都讀取不到QQQQQQQQQQQQQQQQQQ
haha says:

October 14, 2011 at 8:10 pm

我的噗浪讀不到怎麼辦
最想測的就是噗浪說 TTTTTTTTTTTTTTTTTTTTTT
Din says:

October 15, 2011 at 1:21 am

也是一直讀取不了噗浪怎麼會這樣QAQ
冰音 says:

October 22, 2011 at 11:53 pm

建議一下可以增加「文字 filter」的功能，不然我現在都被「啊啊啊」跟「哈哈哈」這些助語詞主宰了，整個圖就只有這兩個詞 Orz
Cybergabi says:

December 7, 2011 at 3:27 pm

Hi Tim, I love your HTML5 Word Cloud script. May I make one suggestion? Make a negative list of words which are not processed in the feed. I am thinking particularly of possessive pronouns, such as my, me, you, yours, they, their, etc., interrogations such as which, what, when, where, why, how; prepositions such as but, instead, of, after, out, over, only etc., auxiliary verbs such as have, had, was, is, am; and adverbs like here and there. That would render more nouns, verbs, and adjectives, and therefore deliver a much better picture of the analyzed text. Good luck – great work!
Timothy Chien says:

December 7, 2011 at 5:15 pm

Actually I have already filtered out some works like “is”, “a”, etc. The array of the small list can be found at wordfreq.worker.js.

It’s kind of subjective when talked about the words to filter out. Please pull the code and modified the said array to make your own Word Cloud. I could made the list customizable on UI “someday”.
花咲海夜 says:

December 9, 2011 at 4:11 am

This doesn’t work for me, when I try to use it for facebook. It says to sign in, so I click the button, but ABSOLUTELY NOTHING HAPPENS. This looks like it’d be a really awesome application, if it worked!! Is there some way you can check out and see why the facebook portion isn’t working?
Thanks! ^_^
Local Search SEO says:

December 11, 2011 at 1:08 am

Very interesting and very cool code you whipped up! I ran a couple of scans on my WordPress site, my G+ site and my Google Places Listing. Strong work! Hope you figure out a way to monetize this code, you deserve it. I’d donate if you whipped up a PayPal button. Happy Holidays — Neil
Veign says:

December 16, 2011 at 7:48 am

Can it be modified, easily, to handle phrases?
AmyB says:

January 12, 2012 at 5:13 am

Works on everything else but facebook. I tried everything suggested and it will not work.
Free says:

January 20, 2012 at 8:35 am

Cool, it’s wordle without the hassle! You’re not there yet but it’s a great start. I wish I could have a common wordlist to skim conjunctions of coordinations and articles out of the picture. Congratulations!
Daniel Schwabe says:

January 28, 2012 at 8:58 pm

May I suggest you eliminate the protocol words such as “http”or “ftp” which appear in URLs? For example, in my Facebook word cloud, HTTP is the most frequent word…
In any case, very nice work!
alexander chavez says:

May 5, 2012 at 8:01 pm

you are awesome dude, i just tried a moment ago, and the velocity, simplicity the comfort, its great . congratulations from Lima-Peru
CH says:

May 6, 2012 at 7:01 am

I agree with the others. This is very cool and not limited to RSS feeds like wordle.net. As the others mentioned, if you can filter out http, you, me, etc. that would make it even better.
ATom says:

May 7, 2012 at 6:35 pm

My suggestion eleminate also too shors world like on, at, and, or, etc.
YITING! says:

May 10, 2012 at 11:24 pm

twitter也讀不到呢!
不過這很棒!
Timothy Chien says:

May 11, 2012 at 12:38 am

http://timc.idv.tw/wordcloud/zh/#feed:http://twitter.com/statuses/user_timeline/hurt7113.rss

Works for me 🙂
ML says:

July 12, 2012 at 8:11 pm

不能儲存
每次按儲存都會有空白網頁彈出來之後就沒畫面了
Pingback: 我也寫了輸入法，而且用 Javascript！ | Blog: timdream
Anonymous says:

September 21, 2012 at 12:24 am

It seems that Wikipedia is tied to the Chinese version.
Could you make it use the English version when in English mode?
Timothy Chien says:

September 21, 2012 at 10:07 am

Fixed.
kk says:

October 25, 2012 at 9:19 pm

您好我無法執行臉書的文字雲

我有照網頁指示””Facebook 權限頁面刪除 HTML5 Word Cloud 再重新認證。”結果卻同樣失敗想請教解決辦法謝謝
ceparis says:

April 12, 2013 at 3:24 pm

我也碰到了这个问题，是否UTF-8需要修改成其他编码方式？
Timothy Chien says:

April 12, 2013 at 3:36 pm

請轉存 UTF-8 編碼的文字檔案。
Jay Wang says:

April 17, 2013 at 7:17 pm

這東西不錯玩只是我好奇在 facebook 裡面
這個程式抓了哪邊的東西? 是我的對話還是我的動態
ceparis says:

April 22, 2013 at 8:33 pm

嗯，用word转成UTF-8之后，果然就没问题了哈哈，谢谢！
不过出来了很多“我们”、“在每”、“每一”、“在这”、“在我”、“我不”、“我看”、“和你”、“有一”这种不是词语的词语，不知道博主有没有解决办法呢？
ids93216 says:

June 16, 2013 at 11:15 pm

請問一下…為什麼我試了我Facebook和我的網站都是”讀取失敗，請稍後再試。”呢？謝謝
Ryan says:

June 20, 2013 at 1:52 pm

您好 , 我的FB一直跑不出文字雲結果 , 可以幫忙看一下嗎 , 感謝喔～
https://www.facebook.com/sagax0802
Ryan says:

June 20, 2013 at 7:33 pm

不好意思我的FB無法跑出結果一直讀取失敗
https://www.facebook.com/sagax0802
robler says:

June 21, 2013 at 9:13 am

在抓噗浪內容的時候經常會把帳號當成最常出現的文字，還滿奇怪的Orz
Snow Kao says:

June 22, 2013 at 11:12 am

不知道是不是因為FB的文字太多，一直無法產出成功，有點可惜。
Pili Mao says:

June 23, 2013 at 4:57 am

若用IE 確實有問題但改試Google Chrome即可
Snow Kao says:

June 23, 2013 at 2:32 pm

一直都讀取失敗，換了電腦跟瀏覽器還是一樣，怎麼會這樣><
joelee says:

May 21, 2015 at 2:23 pm

請問有api 可呼叫嗎?
或DLL供參考嗎
timdream says:

June 1, 2015 at 5:55 pm

這是 Client-side web app … 沒有 Server API 可以呼叫。

Comments are closed.