Archive for the ‘Study’ Category

Installing Emacs on Mac

Monday, October 12th, 2015

This is my own technical notes on how to install and config Emacs on Mac OS X.
(more…)

C# Links

Tuesday, December 9th, 2014

Periodic Execution in .NET: This is the best implementation note I found of the common pattern to run tasks periodically.

Implementing the Singleton Pattern in C#: I actually found singleton not all that useful. The article is a good discussion of many C# features nonetheless.

MADlib Paper Review

Monday, December 1st, 2014

I only read the paper The MADlib Analytics Library or MAD Skills, the SQL recently after it has published for almost two years. In short, I like it a lot. Most mainstream relational databases do not support even the most common machine learning tasks like linear regression. In order to run machine learning algorithms on the data inside the databases, one has to first export the data, then use statistics or machine learning tools, e.g. R, to get insight from the data. This is really cumbersome in my opinion. As machine learning becoming more and more common place nowadays. Most major database processing platforms include some kind machine libraries. For example, Mahout for Hadoop and MLlib for Spark. How do we address the apparent machine learning tool gap in relational databases? This paper introduces an implementation of machine libraries inside RDBMS. It builds on top of existing RDBMS, and requires no altering of any of the database internals like storage engine or extra SQL syntax. Some of my observations if taking a closer look.

  • Some functions like linear regression seem very natural and intuitive to use. However, some functions like LDA does not seem to fit in databases as naturally.
  • The implementation relies heavily on the array feature. This is a well supported feature in PostgreSQL but not in some other databases like MySQL and SQL Server. “Array” is actually defined in SQL 2003 standard, and can be very useful in other ways I can image. So I hope the widespread of MADlib or equivalent technologies can drive the adoption of arrays in other major database products; or the other way around.

《Steve Jobs》读后

Tuesday, January 10th, 2012

花了差不多两个星期读完了Walter Isaacson的《Steve Jobs》。感受最深的一个词:product。一切都是为产品服务,苹果的成功也是建立在它一代又一代成功的产品上的。

Algorithm Analysis

Saturday, December 10th, 2011

From my experience, usually, only two algorithm analysis techniques are tought in college level algorithm analysis classes: asymptotic analysis and amortized analysis. There are some addtional analysis frameworks or tools that can reveal or explain other interesting aspects of certain algorithms and data structures. Here are three interesting examples.

Competitive analysis
I think the best introduction to this line of analysis is still Sleator and Tarjan’s 1985 paper, “Amortized efficiency of list update and paging rules.” It reveals why a simple list update operation can be as competitive as an optimally designed algorithm that knew the data it operates on in advance. The analysis is simple yet surprising. The paper alone is worth reading for pleasure if you did not know it before.

External memory
RDBMS is the most popular data management system in the past three decases. To my knowledge, every RDBMS includes some kind of B-tree implementation. The external memeory analysis shows why algorithms and data structures like B-tree operate so well in the world of layered storage systems. In other words, if you data does not fit into memory, you may want to use this model to analysis and design your algorithms that need to access data outside memory.

Smoothed analysis
Simplex method is a popular algorithm to solve linear programming problems. It has a exponential time worst case according to asymptotic analysis. However in practice, it works very well. The smoothes analysis offered an theoritical explanation of the phenomena.

The three examples above are interesting for me. After I graduated from college and began to work in the industry, there are times I found the code used in practice was quite different than what I thought and wrote in college. When choices of data structures and algorithms need to be made for problems, there are places algorithms with worse big Os are used, and simple and straight forward data structures are favored over sophisticated ones that may prove better on paper. However, when I look deeper and analysis in the exact context of the problem, many times those choices begin to make sense. The three examples above always reminded me when considering an algorithm and data structure for a real problem, there are usually more things need to be taken into account than simple big O notations. I think this is also the reason we still need to do simulations and experiments when we choose or design algorithms and data structures, at least for many real world problems.

参观737的生产线

Monday, September 27th, 2010

因为同学的关系,今天去参观了波音生产737的工厂

这个工厂是最后完成737组装的生产线。零部件很多都是别的公司,或波音其它分公司制造。参过的过程中看到一排大木箱,上面写的是上海飞机工业公司,但不知道是什么部件。引擎是最昂贵的部件,大约占飞机造价三分之一。并不是波音生产的。现在737使用的引擎是由GE(美国)和SNECMA(法国)的合资公司CFMI生产的。

其实组装的过程并不复杂。工程师的办公室就在厂房里,机械师在组装时遇到问题,工程师可以随时下到生产线上察看。工程师的办公环境比较普通。

我的感觉是,飞机制造,最核心和困难的地方是设计。而引擎的设计和制造更是技术含量最高的部分。

毕业小结

Wednesday, May 19th, 2010

虽然还没有到毕业典礼正式取得学位,但到今天差不多可以说一切都尘埃落定了。最后一学期只上了一门课,却取得了最差的成绩。

我知道我可以做得更好,就好像当初高考一样。但另一方面,也像我的高三,我完全没有努力去做。可能是因为兴趣不在于此,也可能是心态发生了变化,也可能真的是能力不够。不管怎么说,对纯数学我已经是意兴阑珊了。我不能说我没有收获,虽然成绩不理想,但我觉得现在让我拿起任何一本数学书,如果我想的话,我都能够读下去。但另一方面证明一些抽象的简单结果在我看来越来越没意义。我想我可能会放弃学纯数学了吧。(纯数学,我的解释就是为数学而数学。)

虽然最后一学期非常不理想,整体来说,在美国这两年收获还是很大的。学了想学的知识(NLP,machine learning),也发现了新的兴趣(database,theory)。虽然没有实现当初继续读博士的计划,但毕业前很久就已经找到非常理想也符合我兴趣的工作。面对未来的方向,迷茫少了一些,信心多了一些。一步一个脚印走下去,我觉得我能走出属于自己的通向成功之路。

HFT Seminar

Friday, May 7th, 2010

今天上午听了一场关于HFT的seminar。HFT在美国流行到什么程度呢?今天听说超过60%的交易量都是来自HFT。其中大约一半是投资银行和造市商,一半是独立机构。时间很少,而且因为牵涉到商业机密,很多具体的地方都语焉不详。我的有几点印象。可能不准确的地方也很多。

  1. 交易系统本身架构很简单。
    • 输入是交易数据。也有公司利用新闻数据,不过会复杂很多。
    • 第一层处理是把数据转换成自己的格式。
    • 处理完之后计算模型在一个系统。这个系统是一个整体。包括了交易模型、风险控制、自动下单等等。没有多层layer。他提了很多次,多层架构对他们没有意义。
    • 在HTF领域,模型相对简单。时间复杂度可能只有线性。
    • 这个系统本身的数据库是一个自己写的in memory database。(这种似乎很常见:Bloomberg说他们的新闻处理也是自己写的in memory database。)
    • 他后来演示了一下这个数据库,binary大概500k左右,代码180k lines左右。不是SQL。用的是自己的数据处理语言。据说快很多。(这个我表示怀疑。)
  2. 套利模式主要是利用计算机的速度。和交易的规模。
  3. 每天清仓。他给的数字好像是每天不超过30million$。
  4. 他们要考虑到网络延迟。比如从chicago过来的光纤比纽约要慢10ms,欧洲可能要x10。
  5. 当他们的交易量足够大的时候,证交所或者造市商还要向他们付钱。(我的理解是因为他们提供了流动性。)
  6. 第三方提供的接口和数据很不干净。(包括.net, java, c, etc.)
  7. 他们的系统从接到数据,到计算出结果比数据和交易的latency还要快。他给的时间好像是20纳秒。
  8. 但是第三方交易系统的处理能力往往不够,所以他们的系统要把对方的处理能力也考虑进去。
  9. 这个演讲的人比我想的要聪明。中间有人提问他们用不用GPU。他大概讲了一下为什么GPU的计算模型不适用。分析很到位。之前的数据库也是一个人写出来的。
  10. 他们还写自己的java compiler,以及很多自己in house的技术和工具,包括前面的数据库。
  11. 他们招人的时候主要看这个人懂不懂一个程序在电脑里是怎么运行的。每一行程序对memory,cache,CPU的影响。
  12. 测试、容错等等也是难题。

听的过程中一个感觉就是交易所的交易系统似乎非常繁复杂乱,像个菜市场,安全也没有保证。处理这么多的钱的系统,让我觉得有点儿戏。听完以后回来,结果不久道琼斯就因为某错误交易暴跌1000点。

再加一些个人意见:刚才看了一下演讲人的Linkedin资料。他之前在Nvidia做过software engineer。这可能解释了为什么他对GPU计算理解的准确性。他之前工作的另一个公司Kx System也很有意思。这个公司的主要产品就是一个in memory database – kdb,而且它的处理语言也挺有意思:k programming language。我虽然没有用过,但是就描述上,这个数据库就和之前他演示的数据库非常像。

 

Update: 《60 minutes 》最近有一个关于HTF的报道。他们的标题体是speed traders,以快取胜也是我的感觉,与其说是高频交易,不如说是自动快速交易。

http://www.cbsnews.com/video/watch/?id=7368460n

Complexity Theory Side Note

Saturday, November 21st, 2009

I am currently reading the book: Computational Complexity: A Modern Approach. I have taken some side notes while reading. I will update the nodes here since I also found some other interesting related materials online.

Chapter 1

  1. I don’t feel the definition of time constructible function satisfiable. The possible gaps to fill in may be the conditions and properties of such functions. This concept is important for time hierarchical theorem. Question 3.5 is related.
  2. Here is a blog post by Dick Lipton about oblivious Turing machine. This definition is later used for proving Cook-Levin Theorem. Exercise 1.5 and 1.6 can help understand the machine.
  3. On the philosophy issue, another blog post by Daniel Lemier is interesting.
  4. Typos: page16 line 2 in definition of time-constructible functions should be: “the function |x| \to \llcorner T(|x|) \lrcorner …”; page 24 line 2: “Note that we …” should be “We note that …”.

A time line of some important results in first several chapters:

  • Space/Time Hierarchy Theorem [SHL65] [HS65]
  • Savitch’s Theorem [Sav70] : implies PSPACE = NPSPACE
  • Cook-Levin Theorem [Coo71] [Lev73]: SAT is NP-complete
  • Nondeterministic Time Hierarchy Theorem [Coo72]
  • Ladner’s Theorem [Lad75]: If NP \neq P, exist L \in NP \setminus P that is not NP-complete.
  • Immerman-Szelepcsenyi Theorem [Imm88] [Sze87]: \overline{\text{PATH}} \in \text{NL}, which implies NL = coNL.

(to be continued…)

How to Make Youself Understood

Thursday, October 29th, 2009

DK: I tried to write things up in a way that was jargon-free, so the nonspecialist would understand it. What I succeeded in doing is making it so that the specialist can understand it, but if I hadn’t tried to write jargon-free, then I would have written for specialists, and then the specialists wouldn’t be able to get it either.

写给普通人看,专家能看懂;写给专家看,谁也看不懂。高德纳如是说。