Archive for the ‘Study’ Category

Installing Emacs on Mac

Monday, October 12th, 2015

This is my own technical notes on how to install and config Emacs on Mac OS X.

C# Links

Tuesday, December 9th, 2014

Periodic Execution in .NET: This is the best implementation note I found of the common pattern to run tasks periodically.

Implementing the Singleton Pattern in C#: I actually found singleton not all that useful. The article is a good discussion of many C# features nonetheless.

MADlib Paper Review

Monday, December 1st, 2014

I only read the paper The MADlib Analytics Library or MAD Skills, the SQL recently after it has published for almost two years. In short, I like it a lot. Most mainstream relational databases do not support even the most common machine learning tasks like linear regression. In order to run machine learning algorithms on the data inside the databases, one has to first export the data, then use statistics or machine learning tools, e.g. R, to get insight from the data. This is really cumbersome in my opinion. As machine learning becoming more and more common place nowadays. Most major database processing platforms include some kind machine libraries. For example, Mahout for Hadoop and MLlib for Spark. How do we address the apparent machine learning tool gap in relational databases? This paper introduces an implementation of machine libraries inside RDBMS. It builds on top of existing RDBMS, and requires no altering of any of the database internals like storage engine or extra SQL syntax. Some of my observations if taking a closer look.

  • Some functions like linear regression seem very natural and intuitive to use. However, some functions like LDA does not seem to fit in databases as naturally.
  • The implementation relies heavily on the array feature. This is a well supported feature in PostgreSQL but not in some other databases like MySQL and SQL Server. “Array” is actually defined in SQL 2003 standard, and can be very useful in other ways I can image. So I hope the widespread of MADlib or equivalent technologies can drive the adoption of arrays in other major database products; or the other way around.

《Steve Jobs》读后

Tuesday, January 10th, 2012

花了差不多两个星期读完了Walter Isaacson的《Steve Jobs》。感受最深的一个词:product。一切都是为产品服务,苹果的成功也是建立在它一代又一代成功的产品上的。

Algorithm Analysis

Saturday, December 10th, 2011

From my experience, usually, only two algorithm analysis techniques are tought in college level algorithm analysis classes: asymptotic analysis and amortized analysis. There are some addtional analysis frameworks or tools that can reveal or explain other interesting aspects of certain algorithms and data structures. Here are three interesting examples.

Competitive analysis
I think the best introduction to this line of analysis is still Sleator and Tarjan’s 1985 paper, “Amortized efficiency of list update and paging rules.” It reveals why a simple list update operation can be as competitive as an optimally designed algorithm that knew the data it operates on in advance. The analysis is simple yet surprising. The paper alone is worth reading for pleasure if you did not know it before.

External memory
RDBMS is the most popular data management system in the past three decases. To my knowledge, every RDBMS includes some kind of B-tree implementation. The external memeory analysis shows why algorithms and data structures like B-tree operate so well in the world of layered storage systems. In other words, if you data does not fit into memory, you may want to use this model to analysis and design your algorithms that need to access data outside memory.

Smoothed analysis
Simplex method is a popular algorithm to solve linear programming problems. It has a exponential time worst case according to asymptotic analysis. However in practice, it works very well. The smoothes analysis offered an theoritical explanation of the phenomena.

The three examples above are interesting for me. After I graduated from college and began to work in the industry, there are times I found the code used in practice was quite different than what I thought and wrote in college. When choices of data structures and algorithms need to be made for problems, there are places algorithms with worse big Os are used, and simple and straight forward data structures are favored over sophisticated ones that may prove better on paper. However, when I look deeper and analysis in the exact context of the problem, many times those choices begin to make sense. The three examples above always reminded me when considering an algorithm and data structure for a real problem, there are usually more things need to be taken into account than simple big O notations. I think this is also the reason we still need to do simulations and experiments when we choose or design algorithms and data structures, at least for many real world problems.


Monday, September 27th, 2010






Wednesday, May 19th, 2010



虽然最后一学期非常不理想,整体来说,在美国这两年收获还是很大的。学了想学的知识(NLP,machine learning),也发现了新的兴趣(database,theory)。虽然没有实现当初继续读博士的计划,但毕业前很久就已经找到非常理想也符合我兴趣的工作。面对未来的方向,迷茫少了一些,信心多了一些。一步一个脚印走下去,我觉得我能走出属于自己的通向成功之路。

HFT Seminar

Friday, May 7th, 2010


  1. 交易系统本身架构很简单。
    • 输入是交易数据。也有公司利用新闻数据,不过会复杂很多。
    • 第一层处理是把数据转换成自己的格式。
    • 处理完之后计算模型在一个系统。这个系统是一个整体。包括了交易模型、风险控制、自动下单等等。没有多层layer。他提了很多次,多层架构对他们没有意义。
    • 在HTF领域,模型相对简单。时间复杂度可能只有线性。
    • 这个系统本身的数据库是一个自己写的in memory database。(这种似乎很常见:Bloomberg说他们的新闻处理也是自己写的in memory database。)
    • 他后来演示了一下这个数据库,binary大概500k左右,代码180k lines左右。不是SQL。用的是自己的数据处理语言。据说快很多。(这个我表示怀疑。)
  2. 套利模式主要是利用计算机的速度。和交易的规模。
  3. 每天清仓。他给的数字好像是每天不超过30million$。
  4. 他们要考虑到网络延迟。比如从chicago过来的光纤比纽约要慢10ms,欧洲可能要x10。
  5. 当他们的交易量足够大的时候,证交所或者造市商还要向他们付钱。(我的理解是因为他们提供了流动性。)
  6. 第三方提供的接口和数据很不干净。(包括.net, java, c, etc.)
  7. 他们的系统从接到数据,到计算出结果比数据和交易的latency还要快。他给的时间好像是20纳秒。
  8. 但是第三方交易系统的处理能力往往不够,所以他们的系统要把对方的处理能力也考虑进去。
  9. 这个演讲的人比我想的要聪明。中间有人提问他们用不用GPU。他大概讲了一下为什么GPU的计算模型不适用。分析很到位。之前的数据库也是一个人写出来的。
  10. 他们还写自己的java compiler,以及很多自己in house的技术和工具,包括前面的数据库。
  11. 他们招人的时候主要看这个人懂不懂一个程序在电脑里是怎么运行的。每一行程序对memory,cache,CPU的影响。
  12. 测试、容错等等也是难题。


再加一些个人意见:刚才看了一下演讲人的Linkedin资料。他之前在Nvidia做过software engineer。这可能解释了为什么他对GPU计算理解的准确性。他之前工作的另一个公司Kx System也很有意思。这个公司的主要产品就是一个in memory database – kdb,而且它的处理语言也挺有意思:k programming language。我虽然没有用过,但是就描述上,这个数据库就和之前他演示的数据库非常像。


Update: 《60 minutes 》最近有一个关于HTF的报道。他们的标题体是speed traders,以快取胜也是我的感觉,与其说是高频交易,不如说是自动快速交易。

Complexity Theory Side Note

Saturday, November 21st, 2009

I am currently reading the book: Computational Complexity: A Modern Approach. I have taken some side notes while reading. I will update the nodes here since I also found some other interesting related materials online.

Chapter 1

  1. I don’t feel the definition of time constructible function satisfiable. The possible gaps to fill in may be the conditions and properties of such functions. This concept is important for time hierarchical theorem. Question 3.5 is related.
  2. Here is a blog post by Dick Lipton about oblivious Turing machine. This definition is later used for proving Cook-Levin Theorem. Exercise 1.5 and 1.6 can help understand the machine.
  3. On the philosophy issue, another blog post by Daniel Lemier is interesting.
  4. Typos: page16 line 2 in definition of time-constructible functions should be: “the function |x| \to \llcorner T(|x|) \lrcorner …”; page 24 line 2: “Note that we …” should be “We note that …”.

A time line of some important results in first several chapters:

  • Space/Time Hierarchy Theorem [SHL65] [HS65]
  • Savitch’s Theorem [Sav70] : implies PSPACE = NPSPACE
  • Cook-Levin Theorem [Coo71] [Lev73]: SAT is NP-complete
  • Nondeterministic Time Hierarchy Theorem [Coo72]
  • Ladner’s Theorem [Lad75]: If NP \neq P, exist L \in NP \setminus P that is not NP-complete.
  • Immerman-Szelepcsenyi Theorem [Imm88] [Sze87]: \overline{\text{PATH}} \in \text{NL}, which implies NL = coNL.

(to be continued…)

How to Make Youself Understood

Thursday, October 29th, 2009

DK: I tried to write things up in a way that was jargon-free, so the nonspecialist would understand it. What I succeeded in doing is making it so that the specialist can understand it, but if I hadn’t tried to write jargon-free, then I would have written for specialists, and then the specialists wouldn’t be able to get it either.