2010年02月10日

同样的程序在mac下编译, g++用缺省的ARGS编译成32位的程序和使用-m64指定编译为64位程序, 性能上还是有差剧的, 前者的时间结果为

real  6m17.111s

user  6m14.810s

sys    0m1.001s

后者为

real 5m31.003s

user 5m26.337s

sys 0m1.259s

性能提高了将近20%, 似乎主要的提高在用户级的计算.

2010年02月09日

前几天用python写的蒙特卡罗方法筛选股票程序, 发现用时太多, 完全cpu bound的程序, 内存和I/O都不高. 遂决定用c++重写之, 图方便, 使用了stl讲 python的数据结构基本翻译过来, 计算时间只提高到不到3倍(python 用30分钟计算10000个点, c++的用12分钟), 再复杂就的将stl的数据结构替换掉了, 恐怕比较费精力.

目前的程序都是单核的, 以后的优化恐怕要依赖多核了, 多核倒也容易, 启动多个进程就可以, 毕竟没有什么数据依赖性.

2010年02月05日

程序发现bug, 必须在星期一重新调整仓位

2010年02月04日

模拟基本交易者, 使用蒙特卡罗方法筛选参数, 每天根据市值和均线算出下一天所要作的操作, 第二天只需要用开盘价执行昨天的结果, 就可以了, 这样用费时间时刻盯盘, 也避免了人类复杂的感情因素对决策的干扰, 实施两天来看, 效果不错.

不过目前大盘出于恢复上涨阶段, 必须经过至少一个完整的周期才能确定这个效果是否优秀. 而且现在用python写的程序性能太差, 需要用其他语言比如go语言重写, 能运用到多核. 也考虑一下多核而价格适中的机器, 目前的macbook虽然是双核的, 不过计算性能太差.  在这之前, 需要profile一下, 看看是否有python语言数据结构和算法层面的优化.

2009年06月06日

Usually we run a big suite of tests at a single machine, but it lasts long and cannot take fully usage of computing resources within a LAN. Things would be better if the testing tasks can be segmented into sub tasks to run in parallel or distribute pattern.

I’ve created such a so-called hand made CLOUD computing environment to run the functional tests that require a huge amount of times. It is called CLOUD in that the job submitter doesn’t know where this job is processed.

The architecture is very simple yet effective:
  – Two characters: submitter and worker, Worker are OS processes that do the real jobs. submitters map the whole working set(batch) into a sequence of jobs, send them out and then reduce results returned from workers.
  – Use NFS to provide a shared pool/queue to store all tasks, each task has four states: new, pending, running and complete. – All scripts are written in bash which may call some other language component(perl or python)
  – Workers query the pool to take new jobs, do them and then write output into logfile.
  – Submitter use svn st to find changed working sets. it collect all changed files from current working directory and tar them into a job’s folder. The worker who takes the job untars the changed files into its working directory then start the heavy job.
  – When there are no available jobs worker sleep at some interval and query again, or it can be notified via UDP multicast as a new job is coming if UDP multicast is available.
  -  It is intended to be run within cooperative local area networks thus security is not a concern by more subtile configuration can be made. This kind of paradigm can be generalized enough, thus I’ll make it open sourced and released under google code after the business logic is removed.

 All the basic development is completed in one day, it is not a robust system as things like hadoop … but can be easily setup by UNIX developers(not sys admin :) )

2009年02月04日

前文 北京四个区二手房走势 中所作统计的对象是北京的几个区,  效果经检验并不好, 除了通州区外其余几个区的R square 值为0.3左右,  通州区为0.6. 这直接影响了回归方程的确定性.  原因主要有: 海淀朝阳昌平等区面积较大, 楼盘新老不一, 产权复杂等.
更确定的方法, 应该是基于子地区, 户型, 甚至小区的统计, 不过这么精确的划分无疑带来很多工作量, 不是我个人能完成的.

因此, 本人仅仅统计了一个子地区: 朝阳区望京地区的两万多条二手房挂牌记录(大多来自于中介). 由于众所周知的原因, 望京的房价最近降的比较厉害.
同时对数据采用了一定的预处理, 比如掐头去尾等.

在这里 一元回归模型的R square值为0.8166834, 属于可接受的范围.
回归方程的系数为
1.2653403387, -0.0009800008
这里就不预测了, 免得到时候失算, 就糗大了.

2009年02月01日

今天休息在家, 研究一下二手房, 突然想可以用统计的方法研究一下当前二手房价的走势.  互联网上, 有很多数据, 只要会挖掘, 就会发现价值吧.

思路是选取了互联网上北京市四个区(海淀, 朝阳, 昌平, 通州) 的二手房历史市价.  求得每日的均价, 对其进行线性回归, 力图达到预测未来房价的目的.

数据从赶集网的二手房出售页面抓取, 使用Python语言. 抓到本地存储成csv格式的文件.   感谢赶集网, 其数据即丰富又正规, 大大减少了分析的工作量使得整个工作在大半天内完成. 统计分析使用GNU R软件,包括基本的数据读入和绘图(Plot)和简单的线性模型.

好了, 闲话少说, 上数据和图表.
海淀区的二手房房价
海淀区的线性回归系数

     1.5440157415,     -0.0008758534
这两个系数构成一个线性方程, 其直观的意义就是:
从采样的起始日期2008年6月20日, 以后的日期t的价格均值为 1.544万 – 0.0008758534 * ( 需要计算的日期 – 2008年6月20日的天数), 那么2009年2月1日的房价均值约为 1.3451万元 = 1.544 – 0.0008758534 * 227.

朝阳区的二手房房价
朝阳区的线性回归系数为
     1.3147709213,     -0.0005277809

昌平区二手房房价
昌平区的线性回归系数
0.965554124,     -0.000528159
通州区二手房房价
通州区的线性回归系数
0.7966942687,   -0.0006909318

那么我们来预测一下半年后的二手房房价, 半年后, 天数为400.
海淀区:  1.19万元
朝阳区:  1.10万元
昌平区:  0.754万
通州区:  0.54万

一年后呢, 设天数为580
海淀区:  1.03万元
朝阳区:  1.00万元
昌平区:  0.659万
通州区:  0.396万

能看到的是, 海淀区的降速度最大(-0.0008758534) 昌平朝阳的较小.  一年后, 房价能降到相对合理的价格, 不过从个人心理上看, 二手房的房价降价速度还是比较慢 另我不太满意.

当然, 线性的回归毕竟比较粗糙,  真实房价的影响因素也非常复杂, 因此这些结论只能作为参考的说法, 到时候如果和此处的预测不一致, 请别来找我麻烦.

以区为划分, 还是比较粗糙,  本系统还支持对区再划分进行统计和分析, 如海淀区还可以有牡丹园, 上地, 等等. 不过太麻烦了, 这里就不做了.

2008年01月06日

Currenty there are a lot of broadly used distributed computing techniques, such as RPC, CORBA, SOAP, XMLRPC, …, they have different transportations, mechainsms and even policies. But they share a common point, that is some or many functions are exposed for clients to call, it is the only way peers do the communicating. API calling is language neutral but progmatically takes the form of procedure call in modern programming languages. For example, facebook.com provides tens of APIs for its applications, there are documents in natral languages, you need to read them to understand the meaning of APIs and how to call them. This is a big issue when you are facing with hundreds of such services, for developers, most of the codes they have written are preparing arguments for api calls. Algorithms and architectures are burried among these codes.

Why don’t have a RESTful architecture view of a system? What we manipulate now is not those APIs, but resources. Resources only have an identifier(usually a URL) and at most four standard methods: GET to show, UPDATE to change , CREATE to burn, DELETE to die (or maybe five, the fifth one is COPY to move). This is a natral view of a system. Imaging a container that is full of resources and manipulate them through standard GET/UPDATE/CREATE/DELETE methods, then the system should be more easy to understand even if there are few documents.

With these kind of ideas I have build a RESTful distributed platform in python language and xmlrpc over twisted as transport layer.(http://restpy.googlecode.com) A running syste have multiple nodes(nodes are here a operating system process). each nodes contains resources. A resource has reference to address it. we can get a reference of a resources via its node address and identifiers.

Following is sample codes on person manipulating, you can get the sample codes through svn.
Firstly let’s start a node
% python samples/person.py server localhost:8888

Here localhost:8888 is the node’s address from which it can be accessed.
Then we can visit the node remotely, let’s launch the python console.
>>> import restpy # import the restpy modules, of course you need to install it first :)
>>> ps = restpy.reference(‘localhost:8888′, ‘/persons/’) # get a reference to the resource persons@localhost:8888
>>> # localhost:8888 is the address of the node we just launched, ‘/psersons/’ is the identifier of the persons resource at that node.
>>> ps # a reference
‘ref-http://localhost:8888/persons/’
>>> ps.get()
[]
>>> # What we get is an empty list, that means there is no person in the node
>>> p = ps.create(id=567)  # create a person with id=567 and p is it’s reference
>>> ps.get() # now ps has somthing
['ref-http://localhost:8888/persons/567']
>>> p.get() # p has no name yet
‘u[NoName]‘
>>> p.update(name=’Alice’) # set the name of person p
>>> p.get()
‘u[Alice]‘
>>> p.create(greeting=’Hello’) # Give p a greeting
‘Alice: Hello, Thank you!’

That’s it, this is very simple sample. And restpy is still in very raw status, so there are some bugs, I’ll keep developing it to make it more usable, give me a support please :) .

2007年10月09日

There is no switch case control flow in python like many other languages, one method of  doing switch/case is using if .. elif…else, another way is through a dictionary. Then I give here yet another style, using exception.  Please think it just a joke, don’t argue issues such as performance with me.

def case(case_x):
    case_cls = ‘Case_%s’ % case_x
    if case_cls not in globals():
        exec(‘class %s(Exception):pass’ % case_cls)
        globals()[case_cls] = eval(case_cls)
    return globals()[case_cls]

def switch(x):
    raise case(x)()

#Usage
xx = 7
try:
    switch(xx)
except case(8):
    print ‘this is’,
    print ‘ 8!’
except case(7):
    print ‘that is’,
    print ‘ 7!’
except Exception: #default branch
    print ‘default goes here’
    print ‘default’

2007年08月05日

Dict class has a method setdefault:

a.setdefault(k[, x])      a[k] if k in a, else x (also setting it)

It’s a great method, prior to this we have to write

if not a.has_key(k):
    a[k] = x
# do somthing with a[k]

Now setdefault() combines all these into just a single function all, which holds concised style.

But, wait, the two code snippets have slightly different symantics. As for the second one, x is initialized only when k is absent; while for the first one, x is initialized even if there is already a key k in dictionary a. Note that x can be any object, even an object that has side effect. Thus there may be a hole that implicitly distorts the programs behaviour.

Why not initialize x only when it is really needed? why not give setdefault a lazy style. Just like the following calling.

a.setdefaultlazy(ak, lambda k: av) # av can be a value that is depands on k or not.
 
I have made this function extension by inheriting UserDict:

class LazyUserDict(UserDict.UserDict):
    def setdefaultlazy(self, key, absent_cb=None):
        if not key in self.data:
            self.data[key] = absent_cb and absent_cb(key) or None
        return self.data[key]

Here absent_cb is a callable object(thus any object with a __call__ method or function/lambda) that is called with argument key if and only if key is absent from the dict self.

Calling example is simply like this:

a = LazyUserDict({5:7})
iv = 78
a.setdefaultlazy(8, lambda k:k * 2 + iv)
print a