可定制字符串过滤器的设计
【腾讯云】亏本大甩卖,服务器4核16G 1年370元(带宽12M,系统盘120GB SSD盘,月流量2000GB)!!!!!!
云产品 配置 价格
服务器 1核2G,带宽5M,系统盘50GB SSD盘,月流量500GB 38元/年
MySQL 1核1G 19元/年
服务器 16核32G,带宽18M,系统盘250GB SSD盘,月流量5000GB 1197元/年
点我进入腾讯云,查看更多详情

Suppose I've tons of filenames in my_dir/my_subdir, formatted in a some way:

data11_7TeV.00179691.physics_Egamma.merge.NTUP_PHOTON.f360_m796_p541_tid319627_00
data11_7TeV.00180400.physics_Egamma.merge.NTUP_PHOTON.f369_m812_p541_tid334757_00
data11_7TeV.00178109.physics_Egamma.merge.D2AOD_DIPHO.f351_m765_p539_p540_tid312017_00

For example data11_7TeV is the data_type, 00179691 the run number, NTUP_PHOTON the data format.

I want to write an interface to do something like this:

dataset = DataManager("my_dir/my_subdir").filter_type("data11_7TeV").filter_run("> 00179691").filter_tag("m = 796");                     
// don't to the filtering, be lazy
cout << dataset.count();                          // count is an action, do the filtering
vector<string> dataset_list = dataset.get_list(); // don't repeat the filtering
dataset.save_filter("file.txt", "ALIAS");         // save the filter (not the filenames), for example save the regex
dataset2 = DataManagerAlias("file.txt", "ALIAS"); // get the saved filter
cout << dataset2.filter_tag("p = 123").count();

I want lazy behaviour, for example no real filtering has to be done before any action like count or get_list. I don't want to redo the filtering if it is already done. I'm just learning something about design pattern, and I think I can use:

  • an abstract base class AbstractFilter that implement filter* methods
  • factory to decide from the called method which decorator use
  • every time I call a filter* method I return a decorated class, for example:

AbstractFilter::filter_run(string arg) {
    decorator = factory.get_decorator_run(arg);  // if arg is "> 00179691" returns FilterRunGreater(00179691)
    return decorator(this);
}

  • proxy that build a regex to filter the filenames, but don't do the filtering

I'm also learning jQuery and I'm using a similar chaining mechanism.

Can someone give me some hints? Is there some place where a design like this is explained? The design must be very flexible, in particular to handle new format in the filenames.

#0

I believe you're over-complicating the design-pattern aspect and glossing over the underlying matching/indexing issues. Getting the full directory listing from disk can be expected to be orders of magnitude more expensive than the in-RAM filtering of filenames it returns, and the former needs to have completed before you can do a count() or get_list() on any dataset (though you could come up with some lazier iterator operations over the dataset).

As presented, the real functional challenge could be in indexing the filenames so you can repeatedly find the matches quickly. But, even that's unlikely as you presumably proceed from getting the dataset of filenames to actually opening those files, which is again orders of magnitude slower. So, optimisation of the indexing may not make any appreciable impact to your overall program's performance.

But, lets say you read all the matching directory entries into an array A.

Now, for filtering, it seems your requirements can generally be met using std::multimap find(), lower_bound() and upper_bound(). The most general way to approach it is to have separate multimaps for data type, run number, data format, p value, m value, tid etc. that map to a list of indices in A. You can then use existing STL algorithms to find the indices that are common to the results of your individual filters.

There are a lot of optimisations possible if you happen to have unstated insights / restrictions re your data and filtering needs (which is very likely). For example:

  • if you know a particular filter will always be used, and immediately cuts the potential matches down to a manageable number (e.g. < ~100), then you could use it first and resort to brute force searches for subsequent filtering.

Another possibility is to extract properties of individual filenames into a structure: std::string data_type; std::vector<int> p; etc., then write an expression evaluator supporting predicates like "p includes 924 and data_type == 'XYZ'", though by itself that lends itself to brute-force comparisons rather than faster index-based matching.

I know you said you don't want to use external libraries, but an in-memory database and SQL-like query ability may save you a lot of grief if your needs really are at the more elaborate end of the spectrum.

#1

I would use a strategy pattern. Your DataManager is constructing a DataSet type, and the DataSet has a FilteringPolicy assigned. The default can be a NullFilteringPolicy which means no filters. If the DataSet member function filter_type(string t) is called, it swaps out the filter policy class with a new one. The new one can be factory constructed via the filter_type param. Methods like filter_run() can be used to add filtering conditions onto the FilterPolicy. In the NullFilterPolicy case it's just no-ops. This seems straghtforward to me, I hope this helps.

EDIT: To address the method chaining you simply need to return *this; e.g. return a reference to the DataSet class. This means you can chain DataSet methods together. It's what the c++ iostream libraries do when you implement operator>> or operator<<.

#2

First of all, I think that your design is pretty smart and lends itself well to the kind of behavior you are trying to model.

Anyway, my understanding is that you are trying and building a sort of "Domain Specific Language", whereby you can chain "verbs" (the various filtering methods) representing actions on, or connecting "entities" (where the variability is represented by different naming formats that could exist, although you do not say anything about this).

In this respect, a very interesting discussion is found in Martin Flowler's book "Domain Specific Languages". Just to give you a taste of what it is about, here you can find an interesting discussion about the "Method Chaining" pattern, defined as:

“Make modifier methods return the host object so that multiple modifiers can be invoked in a single expression.”

As you can see, this pattern describes the very chaining mechanism you are positing in your design.

Here you have a list of all the patterns that were found interesting in defining such DSLs. Again, you will be easily find there several specialized patterns that you are also implying in your design or describing as way of more generic patterns (like the decorator). A few of them are: Regex Table Lexer, Method Chaining, Expression Builder, etc. And many more that could help you further specify your design.

All in all, I could add my grain of salt by saying that I see a place for a "command processor" pattern in your specificaiton, but I am pretty confident that by deploying the powerful abstractions that Fowler proposes you will be able to come up with a much more specific and precise design, covering aspect of the problem that right now are simply hidden by the "generality" of the GoF pattern set.

It is true that this could be "overkill" for a problem like the one you are describing, but as an exercise in pattern oriented design it can be very insightful.

#3

I'd suggest starting with the boost iterator library - eg the filter iterator.

(And, of course, boost includes a very nice regex library.)

推荐文章

当您忘记关闭和释放游标时会发生什么?

当您忘记关闭和释放游标时会发生什么?

推荐文章

vmware映像windows 7无法初始化winring0 dll?

vmware映像windows 7无法初始化winring0 dll?

推荐文章

Ruby数组问题

Ruby数组问题

推荐文章

使用web服务从位于公共网页上的表单更新SharePoint 2010列表?

使用web服务从位于公共网页上的表单更新SharePoint 2010列表?

推荐文章

将Rails迁移标记为已迁移

将Rails迁移标记为已迁移

推荐文章

禁用touchlistener在一个上下文中工作而不是在另一个上下文中工作

禁用touchlistener在一个上下文中工作而不是在另一个上下文中工作

推荐文章

自定义添加此按钮图标图形

自定义添加此按钮图标图形

推荐文章

我怎样才能让独角兽理解赫罗库的信号?

我怎样才能让独角兽理解赫罗库的信号?

推荐文章

当用户按下一个图像按钮时,如何使其保持按下状态?

当用户按下一个图像按钮时,如何使其保持按下状态?

推荐文章

char[]和sprintf分段错误

char[]和sprintf分段错误

推荐文章

REST服务中的WSo2 esb4.0.3和JSON

REST服务中的WSo2 esb4.0.3和JSON

推荐文章

如何支持4.0以上的手机,不包括平板电脑?

如何支持4.0以上的手机,不包括平板电脑?

推荐文章

java正则表达式查找和替换

java正则表达式查找和替换

推荐文章

带符号数字字符串的Java数字解析挑战

带符号数字字符串的Java数字解析挑战

推荐文章

fullcalendar json源开始时间?

fullcalendar json源开始时间?

推荐文章

mvc4webapi+应用集成解决方案

mvc4webapi+应用集成解决方案