标签归档:java

从Python调用Java

问题:从Python调用Java

从python调用Java的最佳方法是什么?(jython和RPC对我来说不是一个选择)。

我听说过JCC:http : //pypi.python.org/pypi/JCC/1.9 一个用于从C ++ / Python调用Java的C ++代码生成器,但这需要编译所有可能的调用。我希望有另一个解决方案。

我听说过JPype:http : //jpype.sourceforge.net/ 教程:http : //www.slideshare.net/onyame/mixing-python-and-java

import jpype 
jpype.startJVM(path to jvm.dll, "-ea") 
javaPackage = jpype.JPackage("JavaPackageName") 
javaClass = javaPackage.JavaClassName 
javaObject = javaClass() 
javaObject.JavaMethodName() 
jpype.shutdownJVM() 

这看起来像我需要的。但是,最新版本是2009年1月,我看到人们无法编译JPype。

JPype是一个死项目吗?

还有其他选择吗?

问候,大卫

What is the best way to call java from python? (jython and RPC are not an option for me).

I’ve heard of JCC: http://pypi.python.org/pypi/JCC/1.9 a C++ code generator for calling Java from C++/Python But this requires compiling every possible call; I would prefer another solution.

I’ve hear about JPype: http://jpype.sourceforge.net/ tutorial: http://www.slideshare.net/onyame/mixing-python-and-java

import jpype 
jpype.startJVM(path to jvm.dll, "-ea") 
javaPackage = jpype.JPackage("JavaPackageName") 
javaClass = javaPackage.JavaClassName 
javaObject = javaClass() 
javaObject.JavaMethodName() 
jpype.shutdownJVM() 

This looks like what I need. However, the last release is from Jan 2009 and I see people failing to compile JPype.

Is JPype a dead project?

Are there any other alternatives?

Regards, David


回答 0

这是我对这个问题的总结:从Python调用Java的5种方法

http://baojie.org/blog/2014/06/16/call-java-from-python/(已缓存

简短的答案:Jpype效果很好,并且在许多项目中都得到了证明(例如python-boilerpipe),但是Pyjnius比JPype更快,更简单。

我已经尝试过Pyjnius / Jnius,JCC,javabridge,Jpype和Py4j。

Py4j有点难以使用,因为您需要启动网关,从而增加了另一层脆弱性。

Here is my summary of this problem: 5 Ways of Calling Java from Python

http://baojie.org/blog/2014/06/16/call-java-from-python/ (cached)

Short answer: Jpype works pretty well and is proven in many projects (such as python-boilerpipe), but Pyjnius is faster and simpler than JPype

I have tried Pyjnius/Jnius, JCC, javabridge, Jpype and Py4j.

Py4j is a bit hard to use, as you need to start a gateway, adding another layer of fragility.


回答 1

您也可以使用Py4J。头版上有一个示例和大量文档,但是从本质上讲,您只是从python代码中调用Java方法,就像它们是python方法一样:

from py4j.java_gateway import JavaGateway
gateway = JavaGateway()                        # connect to the JVM
java_object = gateway.jvm.mypackage.MyClass()  # invoke constructor
other_object = java_object.doThat()
other_object.doThis(1,'abc')
gateway.jvm.java.lang.System.out.println('Hello World!') # call a static method

与Jython相反,Py4J的一部分在Python VM中运行,因此它始终与最新版本的Python“保持最新”,并且您可以使用在Jython上运行不佳的库(例如lxml)。另一部分在您要调用的Java VM中运行。

通信是通过套接字而不是通过JNI进行的,并且Py4J具有自己的协议(用于优化某些情况,管理内存等)。

免责声明:我是Py4J的作者

You could also use Py4J. There is an example on the frontpage and lots of documentation, but essentially, you just call Java methods from your python code as if they were python methods:

from py4j.java_gateway import JavaGateway
gateway = JavaGateway()                        # connect to the JVM
java_object = gateway.jvm.mypackage.MyClass()  # invoke constructor
other_object = java_object.doThat()
other_object.doThis(1,'abc')
gateway.jvm.java.lang.System.out.println('Hello World!') # call a static method

As opposed to Jython, one part of Py4J runs in the Python VM so it is always “up to date” with the latest version of Python and you can use libraries that do not run well on Jython (e.g., lxml). The other part runs in the Java VM you want to call.

The communication is done through sockets instead of JNI and Py4J has its own protocol (to optimize certain cases, to manage memory, etc.)

Disclaimer: I am the author of Py4J


回答 2

皮尤尼斯

文件:http : //pyjnius.readthedocs.org/en/latest/

GitHub:https : //github.com/kivy/pyjnius

从github页面:

使用JNI将Java类作为Python类访问的Python模块。

PyJNIus是“进行中的工作”。

快速概述

>>> from jnius import autoclass
>>> autoclass('java.lang.System').out.println('Hello world') Hello world

>>> Stack = autoclass('java.util.Stack')
>>> stack = Stack()
>>> stack.push('hello')
>>> stack.push('world')
>>> print stack.pop() world
>>> print stack.pop() hello

Pyjnius.

Docs: http://pyjnius.readthedocs.org/en/latest/

Github: https://github.com/kivy/pyjnius

From the github page:

A Python module to access Java classes as Python classes using JNI.

PyJNIus is a “Work In Progress”.

Quick overview

>>> from jnius import autoclass
>>> autoclass('java.lang.System').out.println('Hello world') Hello world

>>> Stack = autoclass('java.util.Stack')
>>> stack = Stack()
>>> stack.push('hello')
>>> stack.push('world')
>>> print stack.pop() world
>>> print stack.pop() hello

回答 3

我使用OSX 10.10.2,并成功使用JPype。

遇到Jnius的安装问题(其他人也有),安装了Javabridge,但是在我尝试使用它时出现了神秘的错误,PyJ4的不便之处在于必须首先在Java中启动Gateway服务器,而JCC无法安装。最终,JPype结束了工作。在Github上有一个JPype维护分支。它的主要优点是(a)正确安装,并且(b)可以非常有效地将java数组转换为numpy array(np_arr = java_arr[:]

安装过程为:

git clone https://github.com/originell/jpype.git
cd jpype
python setup.py install

而且您应该能够 import jpype

以下演示有效:

import jpype as jp
jp.startJVM(jp.getDefaultJVMPath(), "-ea")
jp.java.lang.System.out.println("hello world")
jp.shutdownJVM() 

当我尝试调用自己的Java代码时,必须先进行编译(javac ./blah/HelloWorldJPype.java),并且必须将JVM路径从默认值更改(否则,将出现无法解释的“找不到类”错误)。对我来说,这意味着将startJVM命令更改为:

jp.startJVM('/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/MacOS/libjli.dylib', "-ea")
c = jp.JClass('blah.HelloWorldJPype')  
# Where my java class file is in ./blah/HelloWorldJPype.class
...

I’m on OSX 10.10.2, and succeeded in using JPype.

Ran into installation problems with Jnius (others have too), Javabridge installed but gave mysterious errors when I tried to use it, PyJ4 has this inconvenience of having to start a Gateway server in Java first, JCC wouldn’t install. Finally, JPype ended up working. There’s a maintained fork of JPype on Github. It has the major advantages that (a) it installs properly and (b) it can very efficiently convert java arrays to numpy array (np_arr = java_arr[:])

The installation process was:

git clone https://github.com/originell/jpype.git
cd jpype
python setup.py install

And you should be able to import jpype

The following demo worked:

import jpype as jp
jp.startJVM(jp.getDefaultJVMPath(), "-ea")
jp.java.lang.System.out.println("hello world")
jp.shutdownJVM() 

When I tried calling my own java code, I had to first compile (javac ./blah/HelloWorldJPype.java), and I had to change the JVM path from the default (otherwise you’ll get inexplicable “class not found” errors). For me, this meant changing the startJVM command to:

jp.startJVM('/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/MacOS/libjli.dylib', "-ea")
c = jp.JClass('blah.HelloWorldJPype')  
# Where my java class file is in ./blah/HelloWorldJPype.class
...

回答 4

如果您使用的是Python 3,则有一个JPype的分支,称为JPype1-py3

pip install JPype1-py3

这对我适用于OSX / Python 3.4.3。(您可能需要export JAVA_HOME=/Library/Java/JavaVirtualMachines/your-java-version

from jpype import *
startJVM(getDefaultJVMPath(), "-ea")
java.lang.System.out.println("hello world")
shutdownJVM()

If you’re in Python 3, there’s a fork of JPype called JPype1-py3

pip install JPype1-py3

This works for me on OSX / Python 3.4.3. (You may need to export JAVA_HOME=/Library/Java/JavaVirtualMachines/your-java-version)

from jpype import *
startJVM(getDefaultJVMPath(), "-ea")
java.lang.System.out.println("hello world")
shutdownJVM()

回答 5

最近,我一直在将很多东西集成到Python中,包括Java。我发现的最可靠的方法是使用IKVM和C#包装器。

IKVM有一个简洁的小应用程序,它允许您使用任何Java JAR,并将其直接转换为.Net DLL。它只是将JVM字节码转换为CLR字节码。有关详细信息,请参见http://sourceforge.net/p/ikvm/wiki/Ikvmc/

转换后的库的行为就像本机C#库一样,您可以使用它而无需JVM。然后,您可以创建一个C#DLL包装器项目,并添加对转换后的DLL的引用。

现在,您可以创建一些包装程序存根,以调用要公开的方法,并将这些方法标记为DllEport。有关详细信息,请参见https://stackoverflow.com/a/29854281/1977538

包装DLL的行为就像本机C库一样,导出的方法看起来像导出的C方法。您可以照常使用ctype连接到它们。

我已经在Python 2.7上进行过尝试,但是它也应该在3.0上也可以使用。在Windows和Linuxes上均可使用

如果您碰巧使用C#,那么这可能是将几乎所有内容都集成到python中的最佳方法。

I’ve been integrating a lot of stuff into Python lately, including Java. The most robust method I’ve found is to use IKVM and a C# wrapper.

IKVM has a neat little application that allows you to take any Java JAR, and convert it directly to .Net DLL. It simply translates the JVM bytecode to CLR bytecode. See http://sourceforge.net/p/ikvm/wiki/Ikvmc/ for details.

The converted library behaves just like a native C# library, and you can use it without needing the JVM. You can then create a C# DLL wrapper project, and add a reference to the converted DLL.

You can now create some wrapper stubs that call the methods that you want to expose, and mark those methods as DllEport. See https://stackoverflow.com/a/29854281/1977538 for details.

The wrapper DLL acts just like a native C library, with the exported methods looking just like exported C methods. You can connect to them using ctype as usual.

I’ve tried it with Python 2.7, but it should work with 3.0 as well. Works on Windows and the Linuxes

If you happen to use C#, then this is probably the best approach to try when integrating almost anything into python.


回答 6

我刚刚开始使用JPype 0.5.4.2(2011年7月),并且看起来工作得很好…
我使用的是Xubuntu 10.04

I’m just beginning to use JPype 0.5.4.2 (july 2011) and it looks like it’s working nicely…
I’m on Xubuntu 10.04


回答 7

我假设,如果您可以从C ++到Java,那么您已经准备就绪。我看过您提到的那种产品效果很好。碰巧我们使用的是CodeMesh。我没有特别认可该供应商,也没有对他们的产品相对质量发表任何声明,但是我看到它在相当大的情况下有效。

我通常会说,如果可能的话,我建议您尽量避免通过JNI直接集成。一些简单的REST服务方法或基于队列的体系结构将更易于开发和诊断。如果仔细使用这样的去耦技术,您将获得相当不错的性能。

I’m assuming that if you can get from C++ to Java then you are all set. I’ve seen a product of the kind you mention work well. As it happens the one we used was CodeMesh. I’m not specifically endorsing this vendor, or making any statement about their product’s relative quality, but I have seen it work in quite a high volume scenario.

I would say generally that if at all possible I would recommend keeping away from direct integration via JNI if you can. Some simple REST service approach, or queue-based architecture will tend to be simpler to develop and diagnose. You can get quite decent perfomance if you use such decoupled technologies carefully.


回答 8

根据我自己的经验,尝试从python ia中运行某些Java代码的方式类似于在python中的Java代码中运行python代码的方式,我无法找到一种简单的方法。

我对问题的解决方案是通过在具有适当包和变量的临时文件中编辑Java代码后,通过从python代码中将beanshell解释程序作为shell commnad调用,将此Java代码作为beanshell脚本运行。

如果我在说什么对您有任何帮助,很高兴能帮助您分享我的解决方案的更多详细信息。

Through my own experience trying to run some java code from within python i a manner similar to how python code runs within java code in python, I was unable to a find a straight forward methodology.

My solution to my problem was by running this java code as beanshell scripts by calling the beanshell interpreter as a shell commnad from within my python code after editing the java code in a temporary file with the appropriate packages and variables.

If what I am talking about is helpful in any manner, I am glad to help you sharing more details of my solutions.


在Java中调用Python?

问题:在Java中调用Python?

我想知道是否可以使用jython从Java代码调用python函数,还是仅用于从python调用Java代码?

I am wondering if it is possible to call python functions from java code using jython, or is it only for calling java code from python?


回答 0

Jython:适用于Java平台的Python- http ://www.jython.org/index.html

您可以使用Jython从Java代码轻松调用python函数。只要您的python代码本身在jython下运行,即不使用某些不受支持的c扩展名。

如果这对您有用,那肯定是您可以获得的最简单的解决方案。否则,您可以使用org.python.util.PythonInterpreter新的Java6解释器支持。

我的脑海中有一个简单的例子-但我希望它可以工作:(为简便起见,没有进行错误检查)

PythonInterpreter interpreter = new PythonInterpreter();
interpreter.exec("import sys\nsys.path.append('pathToModules if they are not there by default')\nimport yourModule");
// execute a function that takes a string and returns a string
PyObject someFunc = interpreter.get("funcName");
PyObject result = someFunc.__call__(new PyString("Test!"));
String realResult = (String) result.__tojava__(String.class);

Jython: Python for the Java Platform – http://www.jython.org/index.html

You can easily call python functions from Java code with Jython. That is as long as your python code itself runs under jython, i.e. doesn’t use some c-extensions that aren’t supported.

If that works for you, it’s certainly the simplest solution you can get. Otherwise you can use org.python.util.PythonInterpreter from the new Java6 interpreter support.

A simple example from the top of my head – but should work I hope: (no error checking done for brevity)

PythonInterpreter interpreter = new PythonInterpreter();
interpreter.exec("import sys\nsys.path.append('pathToModules if they are not there by default')\nimport yourModule");
// execute a function that takes a string and returns a string
PyObject someFunc = interpreter.get("funcName");
PyObject result = someFunc.__call__(new PyString("Test!"));
String realResult = (String) result.__tojava__(String.class);

回答 1

嘿,我想我会输入我的答案,尽管已经很晚了。我想首先要考虑一些重要的事情,即您希望在java和python之间建立多强的连接。

首先 ,您是否只想调用函数,或者您是否真的希望python代码更改Java对象中的数据?这个非常重要。如果您只想调用带或不带参数的python代码,那并不是很难。如果您的参数是基元,那么它将变得更加容易。但是,如果您想让Java类在python中实现成员函数,这些成员函数会更改java对象的数据,那么这并不是那么容易或直接的。

其次,我们在谈论cpython还是jython做?我会说cpython是它的所在!我主张这就是为什么python如此强大的原因!但是,在需要时具有如此高的抽象度却可以访问c,c ++。想象一下您是否可以在Java中使用它。这个问题甚至都不值得问jython是否还可以,因为这样很容易。

因此,我使用以下方法,并从容易到困难列出了它们:

Java到Jython

优点:轻而易举。实际引用Java对象

缺点:没有CPython,非常慢!

来自Java的Jython非常简单,如果确实够了,那就太好了。但是它非常慢并且没有cpython!没有cpython值得生活,我不这么认为!您可以轻松地让python代码为java对象实现成员函数。

通过Pyro从Java到Jython到CPython

Pyro是python的远程对象模块。您在cpython解释器上有一些对象,您可以向其发送通过序列化传输的对象,也可以通过此方法返回对象。请注意,如果您从jython发送一个序列化的python对象,然后调用某些函数来更改其成员中的数据,那么您将在java中看不到这些更改。您只需要记住从pyro发送回想要的数据。我相信这是进入cpython的最简单方法!您不需要任何jni或jna或swig或…。您不需要了解任何c或c ++。酷吧?

优点:访问cpython,不像以下方法那样困难

缺点:无法直接从python更改java对象的成员数据。有点间接,(jython是中间人)。

通过JNI / JNA / SWIG将Java转换为C / C ++,通过嵌入式解释器转换为Python(也许使用BOOST库?)

OMG这种方法不适合胆小的人。我可以告诉您,用一种体面的方法来实现这一目标已经花了我很长时间。您要执行此操作的主要原因是,您可以运行cpython代码,以完全控制您的java对象。在决定尝试使用python(就像一匹马)为java(像黑猩猩)做面包之前,需要考虑一些主要的主要事情。首先,如果您崩溃的解释器为您的程序点亮了!而且不要让我开始讨论并发问题!另外,还有锅炉分配器,我相信我已经找到了使该锅炉最小化的最佳配置,但仍然是分配器!那么该怎么做:考虑一下C ++是您的中间人,您的对象实际上就是c ++对象!很好,您现在就知道。只需编写您的对象,就好像您的程序在cpp中而不是java中一样,您想从两个世界访问的数据。然后,您可以使用名为swig(http://www.swig.org/Doc1.3/Java.html),以使Java可以访问此文件并编译一个dll,您可以在java中调用System.load(此处为dll名称)。首先使此工作正常,然后继续进行困难的工作!要使用python,您需要嵌入一个解释器。首先,我建议您编写一些hello解释程序或本教程 python嵌入C / C中。一旦完成这项工作,就该让马和Monkey跳舞了!您可以通过[boost] [3]将c ++对象发送给python。我知道我没有给你鱼,只是告诉你在哪里可以找到鱼。编译时需要注意的一些指针。

编译boost时,您将需要编译一个共享库。并且您需要包括并链接到jdk中所需的内容,即jawt.lib,jvm.lib(启动应用程序时,您的路径中还将需要客户端jvm.dll)以及python27.lib或以及boost_python-vc100-mt-1_55.lib。然后包括Python / include,jdk / include,boost,并且仅使用共享库(dll),否则boost有眼泪。是的,我知道。有很多方法可以解决此问题。因此,请确保您一步一步地完成每件事。然后将它们放在一起。

Hey I thought I would enter my answer to this even though its late. I think there are some important things to consider first with how strong you wish to have the linking between java and python.

Firstly Do you only want to call functions or do you actually want python code to change the data in your java objects? This is very important. If you only want to call some python code with or without arguments, then that is not very difficult. If your arguments are primitives it makes it even more easy. However if you want to have java class implement member functions in python, which change the data of the java object, then this is not so easy or straight forward.

Secondly are we talking cpython or will jython do? I would say cpython is where its at! I would advocate this is why python is so kool! Having such high abstractions however access to c,c++ when needed. Imagine if you could have that in java. This question is not even worth asking if jython is ok because then it is easy anyway.

So I have played with the following methods, and listed them from easy to difficult:

Java to Jython

Advantages: Trivially easy. Have actual references to java objects

Disadvantages: No CPython, Extremely Slow!

Jython from java is so easy, and if this is really enough then great. However it is very slow and no cpython! Is life worth living without cpython I don’t think so! You can easily have python code implementing your member functions for you java objects.

Java to Jython to CPython via Pyro

Pyro is the remote object module for python. You have some object on a cpython interpreter, and you can send it objects which are transferred via serialization and it can also return objects via this method. Note that if you send a serialized python object from jython and then call some functions which change the data in its members, then you will not see those changes in java. You just need to remember to send back the data which you want from pyro. This I believe is the easiest way to get to cpython! You do not need any jni or jna or swig or …. You don’t need to know any c, or c++. kool huh?

Advantages: Access to cpython, not as difficult as following methods

Disadvantages: Cannot change the member data of java objects directly from python. Is somewhat indirect, (jython is middle man).

Java to C/C++ via JNI/JNA/SWIG to Python via Embedded interpreter (maybe using BOOST Libraries?)

OMG this method is not for the faint of heart. And I can tell you it has taken me very long to achieve this in with a decent method. Main reason you would want to do this is so that you can run cpython code which as full rein over you java object. There are major major things to consider before deciding to try and bread java (which is like a chimp) with python (which is like a horse). Firstly if you crash the interpreter that’s lights out for you program! And don’t get me started on concurrency issues! In addition, there is allot allot of boiler, I believe I have found the best configuration to minimize this boiler but still it is allot! So how to go about this: Consider that C++ is your middle man, your objects are actually c++ objects! Good that you know that now. Just write your object as if your program as in cpp not java, with the data you want to access from both worlds. Then you can use the wrapper generator called swig (http://www.swig.org/Doc1.3/Java.html) to make this accessible to java and compile a dll which you call System.load(dll name here) in java. Get this working first, then move on to the hard part! To get to python you need to embed an interpreter. Firstly I suggest doing some hello interpreter programs or this tutorial Embedding python in C/C. Once you have that working, its time to make the horse and the monkey dance! You can send you c++ object to python via [boost][3] . I know I have not given you the fish, merely told you where to find the fish. Some pointers to note for this when compiling.

When you compile boost you will need to compile a shared library. And you need to include and link to the stuff you need from jdk, ie jawt.lib, jvm.lib, (you will also need the client jvm.dll in your path when launching the application) As well as the python27.lib or whatever and the boost_python-vc100-mt-1_55.lib. Then include Python/include, jdk/include, boost and only use shared libraries (dlls) otherwise boost has a teary. And yeah full on I know. There are so many ways in which this can go sour. So make sure you get each thing done block by block. Then put them together.


回答 2

在Java中包含python代码并不明智。用flask或其他Web框架包装您的python代码,使其成为微服务。使您的Java程序能够调用此微服务(例如,通过REST)。

相信我,这很简单,可以为您节省很多问题。而且代码是松散耦合的,因此它们是可伸缩的。

于2020年3月24日更新:根据@stx的评论,上述方法不适用于客户端和服务器之间的海量数据传输。这是我推荐的另一种方法:使用Rust连接Python和Java(也可以使用C / C ++)。 https://medium.com/@shmulikamar/https-medium-com-shmulikamar-connecting-python-and-java-with-rust-11c256a1dfb0

It’s not smart to have python code inside java. Wrap your python code with flask or other web framework to make it as a microservice. Make your java program able to call this microservice (e.g. via REST).

Beleive me, this is much simple and will save you tons of issues. And the codes are loosely coupled so they are scalable.

Updated on Mar 24th 2020: According to @stx’s comment, the above approach is not suitable for massive data transfer between client and server. Here is another approach I recommended: Connecting Python and Java with Rust(C/C++ also ok). https://medium.com/@shmulikamar/https-medium-com-shmulikamar-connecting-python-and-java-with-rust-11c256a1dfb0


回答 3

有几个答案提到您可以使用JNI或JNA来访问cpython,但我不建议您从头开始,因为已经有了用于从java访问cpython的开源库。例如:

Several of the answers mention that you can use JNI or JNA to access cpython but I would not recommend starting from scratch because there are already open source libraries for accessing cpython from java. For example:


回答 4

这里是一个库,可让您一次编写python脚本并确定在运行时使用哪种集成方法(Jython,CPython / PyPy(通过Jep和Py4j)):

https://github.com/subes/invesdwin-context-python

由于每种方法都有其自身的优点/缺点,如链接中所述。

Here a library that lets you write your python scripts once and decide which integration method (Jython, CPython/PyPy via Jep and Py4j) to use at runtime:

https://github.com/subes/invesdwin-context-python

Since each method has its own benefits/drawbacks as explained in the link.


回答 5

这取决于您对python函数的含义是什么?如果它们是用cpython编写的,则不能直接调用它们,则必须使用JNI,但是如果它们是用Jython编写的可以轻松地从Java调用它们,因为jython最终会生成Java字节码。

现在,当我说用cpython或jython编写时,这没有多大意义,因为python是python,并且除非您使用依赖于cpython或java的特定库,否则大多数代码都可以在两种实现上运行。

请参阅此处如何在Java中使用Python解释器。

It depends on what do you mean by python functions? if they were written in cpython you can not directly call them you will have to use JNI, but if they were written in Jython you can easily call them from java, as jython ultimately generates java byte code.

Now when I say written in cpython or jython it doesn’t make much sense because python is python and most code will run on both implementations unless you are using specific libraries which relies on cpython or java.

see here how to use Python interpreter in Java.


回答 6

根据您的要求,诸如XML-RPC之类的选项可能会很有用,它可以用于虚拟地以任何支持协议的语言远程调用函数。

Depending on your requirements, options like XML-RPC could be useful, which can be used to remotely call functions virtually in any language supporting the protocol.


回答 7

GraalVM是一个不错的选择。我已经完成了与GraalVM的Java + Javascript组合用于微服务设计(具有Javascript反射功能的Java)。他们最近增加了对python的支持,我想尝试一下,尤其是多年来这些社区的规模。

GraalVM is a good choice. I’ve done Java+Javascript combination with GraalVM for microservice design (Java with Javascript reflection). They recently added support for python, I’d give it a try especially with how big its community has grown over the years.


回答 8

您可以使用Java Native Interface从Java调用任何语言

You can call any language from java using Java Native Interface


回答 9

Jython有一些限制:

有许多差异。首先,Jython程序不能使用用C编写的CPython扩展模块。这些模块通常具有扩展名为.so,.pyd或.dll的文件。如果要使用这样的模块,则应寻找用纯Python或Java编写的等效模块。尽管在技术上支持此类扩展是可行的-IronPython这样做-在Jython中尚无计划这样做。

使用Jython将我的Python脚本作为JAR文件分发吗?

您只需使用Runtime或ProcessBuilder从Java调用python脚本(或bash或Perl脚本),然后将输出传递回Java:

在Java中运行bash shell脚本

在Java中运行命令行

java runtime.getruntime()从执行命令行程序获取输出

Jython has some limitations:

There are a number of differences. First, Jython programs cannot use CPython extension modules written in C. These modules usually have files with the extension .so, .pyd or .dll. If you want to use such a module, you should look for an equivalent written in pure Python or Java. Although it is technically feasible to support such extensions – IronPython does so – there are no plans to do so in Jython.

Distributing my Python scripts as JAR files with Jython?

you can simply call python scripts (or bash or Perl scripts) from Java using Runtime or ProcessBuilder and pass output back to Java:

Running a bash shell script in java

Running Command Line in Java

java runtime.getruntime() getting output from executing a command line program


回答 10

这样可以很好地概述当前的选项。其中一些在其他答案中被命名。在他们决定实现Python 3.x之前,Jython不可用,并且许多其他项目都来自python方面并希望访问java。但是,还有一些选项可以命名尚未命名的名称:gRPC

This gives a pretty good overview over the current options. Some of which are named in other answers. Jython is not usable until they decide to implement Python 3.x and many of the other projects are coming form the python side and want to access java. But there are a few options still, to name something which has not been named yet: gRPC


Java或Python用于自然语言处理

问题:Java或Python用于自然语言处理

我想知道哪种编程语言更适合自然语言处理。Java还是Python?我发现了很多与此有关的问题和答案。但是我仍然迷失在选择使用哪一个上。

我想知道用于Java的NLP库,因为有很多库(LingPipe,GATE,OpenNLP,StandfordNLP)。对于Python,大多数程序员都建议使用NLTK。

但是,如果我要对非结构化数据(只是自由格式的纯英文文本)进行一些文本处理或信息提取,以获得一些有用的信息,那么最佳选择是什么?Java还是Python?合适的图书馆?

更新

我要做的是从非结构化数据中提取有用的产品信息(例如,用户使用不太标准的英语来制作有关手机或笔记本电脑的不同形式的广告)

I would like to know which programming language is better for natural language processing. Java or Python? I have found lots of questions and answers regarding about it. But I am still lost in choosing which one to use.

And I want to know which NLP library to use for Java since there are lots of libraries (LingPipe, GATE, OpenNLP, StandfordNLP). For Python, most programmers recommend NLTK.

But if I am to do some text processing or information extraction from unstructured data (just free formed plain English text) to get some useful information, what is the best option? Java or Python? Suitable library?

Updated

What I want to do is to extract useful product information from unstructured data (E.g. users make different forms of advertisement about mobiles or laptops with not very standard English language)


回答 0

Java vs Python for NLP非常偏爱或必需。根据公司/项目的不同,您将需要使用其中一个,而除非您负责一个项目,否则通常没有太多选择。

除了NLTK(www.nltk.org),实际上还有其他用于文本处理的库python

(有关更多信息,请参见https://pypi.python.org/pypi?%3Aaction=search&term=natural+language+processing&submit=search

对于Java,还有其他许多吨,但这是另一个清单:

这是基本字符串处理的不错比较,请参阅http://nltk.googlecode.com/svn/trunk/doc/howto/nlp-python.html

GATE与UIMA与OpenNLP的有用比较,请参阅https://www.assembla.com/spaces/extraction-of-cost-data/wiki/Gate-vs-UIMA-vs-OpenNLP?version=4

如果您不确定使用NLP的语言是什么,我个人会说“可以为您提供所需分析/输出的任何语言”,请参阅要学习自然语言处理的语言或工具?

这是NLP工具的最新版本(2017):https : //github.com/alvations/awesome-community-curated-nlp

NLP工具的较旧列表(2013):http ://web.archive.org/web/20130703190201/http: //yauhenklimovich.wordpress.com/2013/05/20/tools-nlp


除了语言处理工具之外,您非常需要将machine learning工具合并到NLP管道中。

有一个整体的范围PythonJava,并再次就看个人喜好和库是否人性化不够:

python中的机器学习库:

(有关更多信息,请参见https://pypi.python.org/pypi?%3Aaction=search&term=machine+learning&submit=search


随着最近(2015年)NLP中的深度学习海啸,您可能可以考虑:https : //en.wikipedia.org/wiki/Comparison_of_deep_learning_software

我将避免出于非偏爱/中立的目的列出深度学习工具。


其他也需要NLP / ML工具的Stackoverflow问题:

Java vs Python for NLP is very much a preference or necessity. Depending on the company/projects you’ll need to use one or the other and often there isn’t much of a choice unless you’re heading a project.

Other than NLTK (www.nltk.org), there are actually other libraries for text processing in python:

(for more, see https://pypi.python.org/pypi?%3Aaction=search&term=natural+language+processing&submit=search)

For Java, there’re tonnes of others but here’s another list:

This is a nice comparison for basic string processing, see http://nltk.googlecode.com/svn/trunk/doc/howto/nlp-python.html

A useful comparison of GATE vs UIMA vs OpenNLP, see https://www.assembla.com/spaces/extraction-of-cost-data/wiki/Gate-vs-UIMA-vs-OpenNLP?version=4

If you’re uncertain, which is the language to go for NLP, personally i say, “any language that will give you the desired analysis/output”, see Which language or tools to learn for natural language processing?

Here’s a pretty recent (2017) of NLP tools: https://github.com/alvations/awesome-community-curated-nlp

An older list of NLP tools (2013): http://web.archive.org/web/20130703190201/http://yauhenklimovich.wordpress.com/2013/05/20/tools-nlp


Other than language processing tools, you would very much need machine learning tools to incorporate into NLP pipelines.

There’s a whole range in Python and Java, and once again it’s up to preference and whether the libraries are user-friendly enough:

Machine Learning libraries in python:

(for more, see https://pypi.python.org/pypi?%3Aaction=search&term=machine+learning&submit=search)


With the recent (2015) deep learning tsunami in NLP, possibly you could consider: https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software

I’ll avoid listing deep learning tools out of non-favoritism / neutrality.


Other Stackoverflow questions that also asked for NLP/ML tools:


回答 1

这个问题很开放。就是说,下面而不是选择一个,而是根据您要使用的语言进行比较(因为两种语言都有不错的库)。

Python

在Python方面,首先要看的是Python Natural Language Toolkit。正如他们在描述中所指出的那样,NLTK是构建Python程序以使用人类语言数据的领先平台。它为50多种语料库和词汇资源(如WordNet)提供了易于使用的界面,并提供了一套用于分类,标记化,词干,标记,解析和语义推理的文本处理库。

您还可以查找一些出色的代码,这些代码源自基于Python的Google自然语言工具包项目。您可以在GitHub上找到该代码的链接。

爪哇

首先看的是斯坦福大学的自然语言处理小组。那里分发的所有软件都是用Java编写的。所有最新发行版都需要Oracle Java 6+或OpenJDK 7+。分发程序包包括用于命令行调用的组件,jar文件,Java API和源代码。

您在许多机器学习环境中看到的另一个很棒的选择(通用选择)是Weka。Weka是用于数据挖掘任务的机器学习算法的集合。这些算法既可以直接应用于数据集,也可以从您自己的Java代码中调用。Weka包含用于数据预处理,分类,回归,聚类,关联规则和可视化的工具。它也非常适合开发新的机器学习方案。

The question is very open ended. That said, rather than choose one, below is a comparison depending on the language that you would like to use (since there are good libraries available in both languages).

Python

In terms of Python, the first place you should look at is the Python Natural Language Toolkit. As they note in their description, NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

There is also some excellent code that you can look up that originated out of Google’s Natural Language Toolkit project that is Python based. You can find a link to that code here on GitHub.

Java

The first place to look would be Stanford’s Natural Language Processing Group. All of software that is distributed there is written in Java. All recent distributions require Oracle Java 6+ or OpenJDK 7+. Distribution packages include components for command-line invocation, jar files, a Java API, and source code.

Another great option that you see in a lot of machine learning environments here (general option), is Weka. Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.


python是否与Java Class.forName()等效?

问题:python是否与Java Class.forName()等效?

我需要使用字符串参数并在Python中创建以该字符串命名的类的对象。在Java中,我会使用Class.forName().newInstance()。Python中是否有等效的东西?


感谢您的答复。回答那些想知道我在做什么的人:我想使用命令行参数作为类名,并实例化它。我实际上是在Jython编程并实例化Java类,因此是问题的Java实质。 getattr()效果很好。非常感谢。

I have the need to take a string argument and create an object of the class named in that string in Python. In Java, I would use Class.forName().newInstance(). Is there an equivalent in Python?


Thanks for the responses. To answer those who want to know what I’m doing: I want to use a command line argument as the class name, and instantiate it. I’m actually programming in Jython and instantiating Java classes, hence the Java-ness of the question. getattr() works great. Thanks much.


回答 0

python中的反射比Java中的反射更容易,更灵活。

我建议阅读本教程

没有直接函数(据我所知)具有完全限定的类名并返回该类,但是您拥有构建该函数所需的所有组件,并且可以将它们连接在一起。

不过,有一点建议:使用python时不要尝试以Java风格进行编程。

如果您可以解释您要尝试的操作,也许我们可以帮助您找到一种更Python的方法。

这是一个执行您想要的功能的函数:

def get_class( kls ):
    parts = kls.split('.')
    module = ".".join(parts[:-1])
    m = __import__( module )
    for comp in parts[1:]:
        m = getattr(m, comp)            
    return m

您可以使用该函数的返回值,就好像它是类本身一样。

这是一个用法示例:

>>> D = get_class("datetime.datetime")
>>> D
<type 'datetime.datetime'>
>>> D.now()
datetime.datetime(2009, 1, 17, 2, 15, 58, 883000)
>>> a = D( 2010, 4, 22 )
>>> a
datetime.datetime(2010, 4, 22, 0, 0)
>>> 

这是如何运作的?

我们正在使用__import__导入包含该类的模块的方法,这要求我们首先从完全限定的名称中提取模块名称。然后我们导入模块:

m = __import__( module )

在这种情况下,m只会引用顶层模块,

例如,如果你的类生活在foo.baz模块,然后m将模块foo
,我们可以很容易地获得一个参考foo.baz使用getattr( m, 'baz' )

要从顶层模块到达类,必须递归使用gettatr类名称的各个部分

举例来说,如果您的类的名称是,foo.baz.bar.Model那么我们这样做:

m = __import__( "foo.baz.bar" ) #m is package foo
m = getattr( m, "baz" ) #m is package baz
m = getattr( m, "bar" ) #m is module bar
m = getattr( m, "Model" ) #m is class Model

这是此循环中发生的事情:

for comp in parts[1:]:
    m = getattr(m, comp)    

在循环的最后,m将是对该类的引用。这意味着m实际上是itslef类,您可以例如执行以下操作:

a = m() #instantiate a new instance of the class    
b = m( arg1, arg2 ) # pass arguments to the constructor

Reflection in python is a lot easier and far more flexible than it is in Java.

I recommend reading this tutorial

There’s no direct function (that I know of) which takes a fully qualified class name and returns the class, however you have all the pieces needed to build that, and you can connect them together.

One bit of advice though: don’t try to program in Java style when you’re in python.

If you can explain what is it that you’re trying to do, maybe we can help you find a more pythonic way of doing it.

Here’s a function that does what you want:

def get_class( kls ):
    parts = kls.split('.')
    module = ".".join(parts[:-1])
    m = __import__( module )
    for comp in parts[1:]:
        m = getattr(m, comp)            
    return m

You can use the return value of this function as if it were the class itself.

Here’s a usage example:

>>> D = get_class("datetime.datetime")
>>> D
<type 'datetime.datetime'>
>>> D.now()
datetime.datetime(2009, 1, 17, 2, 15, 58, 883000)
>>> a = D( 2010, 4, 22 )
>>> a
datetime.datetime(2010, 4, 22, 0, 0)
>>> 

How does that work?

We’re using __import__ to import the module that holds the class, which required that we first extract the module name from the fully qualified name. Then we import the module:

m = __import__( module )

In this case, m will only refer to the top level module,

For example, if your class lives in foo.baz module, then m will be the module foo
We can easily obtain a reference to foo.baz using getattr( m, 'baz' )

To get from the top level module to the class, have to recursively use gettatr on the parts of the class name

Say for example, if you class name is foo.baz.bar.Model then we do this:

m = __import__( "foo.baz.bar" ) #m is package foo
m = getattr( m, "baz" ) #m is package baz
m = getattr( m, "bar" ) #m is module bar
m = getattr( m, "Model" ) #m is class Model

This is what’s happening in this loop:

for comp in parts[1:]:
    m = getattr(m, comp)    

At the end of the loop, m will be a reference to the class. This means that m is actually the class itslef, you can do for instance:

a = m() #instantiate a new instance of the class    
b = m( arg1, arg2 ) # pass arguments to the constructor

回答 1

假设该类在您的范围内:

globals()['classname'](args, to, constructor)

除此以外:

getattr(someModule, 'classname')(args, to, constructor)

编辑:注意,您不能给’att.bar’这样的名称来获取属性。您需要将其分割为。并从左到右在每个块上调用getattr()。这将处理:

module, rest = 'foo.bar.baz'.split('.', 1)
fooBar = reduce(lambda a, b: getattr(a, b), rest.split('.'), globals()[module])
someVar = fooBar(args, to, constructor)

Assuming the class is in your scope:

globals()['classname'](args, to, constructor)

Otherwise:

getattr(someModule, 'classname')(args, to, constructor)

Edit: Note, you can’t give a name like ‘foo.bar’ to getattr. You’ll need to split it by . and call getattr() on each piece left-to-right. This will handle that:

module, rest = 'foo.bar.baz'.split('.', 1)
fooBar = reduce(lambda a, b: getattr(a, b), rest.split('.'), globals()[module])
someVar = fooBar(args, to, constructor)

回答 2

def import_class_from_string(path):
    from importlib import import_module
    module_path, _, class_name = path.rpartition('.')
    mod = import_module(module_path)
    klass = getattr(mod, class_name)
    return klass

用法

In [59]: raise import_class_from_string('google.appengine.runtime.apiproxy_errors.DeadlineExceededError')()
---------------------------------------------------------------------------
DeadlineExceededError                     Traceback (most recent call last)
<ipython-input-59-b4e59d809b2f> in <module>()
----> 1 raise import_class_from_string('google.appengine.runtime.apiproxy_errors.DeadlineExceededError')()

DeadlineExceededError: 
def import_class_from_string(path):
    from importlib import import_module
    module_path, _, class_name = path.rpartition('.')
    mod = import_module(module_path)
    klass = getattr(mod, class_name)
    return klass

Usage

In [59]: raise import_class_from_string('google.appengine.runtime.apiproxy_errors.DeadlineExceededError')()
---------------------------------------------------------------------------
DeadlineExceededError                     Traceback (most recent call last)
<ipython-input-59-b4e59d809b2f> in <module>()
----> 1 raise import_class_from_string('google.appengine.runtime.apiproxy_errors.DeadlineExceededError')()

DeadlineExceededError: 

回答 3

另一个实现。

def import_class(class_string):
    """Returns class object specified by a string.

    Args:
        class_string: The string representing a class.

    Raises:
        ValueError if module part of the class is not specified.
    """
    module_name, _, class_name = class_string.rpartition('.')
    if module_name == '':
        raise ValueError('Class name must contain module part.')
    return getattr(
        __import__(module_name, globals(), locals(), [class_name], -1),
        class_name)

Yet another implementation.

def import_class(class_string):
    """Returns class object specified by a string.

    Args:
        class_string: The string representing a class.

    Raises:
        ValueError if module part of the class is not specified.
    """
    module_name, _, class_name = class_string.rpartition('.')
    if module_name == '':
        raise ValueError('Class name must contain module part.')
    return getattr(
        __import__(module_name, globals(), locals(), [class_name], -1),
        class_name)

回答 4

看来您正在从中间而不是开始着手。您到底想做什么?查找与给定字符串关联的类是达到目的的一种手段。

如果您弄清楚了问题,可能需要您自己进行心理重构,那么可能会发现一个更好的解决方案。

例如:您是否要根据对象的类型名称和一组参数来加载它?Python拼写了这种解开,您应该看一下pickle模块。即使解开流程完全符合您的描述,您也不必担心它在内部如何工作:

>>> class A(object):
...   def __init__(self, v):
...     self.v = v
...   def __reduce__(self):
...     return (self.__class__, (self.v,))
>>> a = A("example")
>>> import pickle
>>> b = pickle.loads(pickle.dumps(a))
>>> a.v, b.v
('example', 'example')
>>> a is b
False

It seems you’re approaching this from the middle instead of the beginning. What are you really trying to do? Finding the class associated with a given string is a means to an end.

If you clarify your problem, which might require your own mental refactoring, a better solution may present itself.

For instance: Are you trying to load a saved object based on its type name and a set of parameters? Python spells this unpickling and you should look at the pickle module. And even though the unpickling process does exactly what you describe, you don’t have to worry about how it works internally:

>>> class A(object):
...   def __init__(self, v):
...     self.v = v
...   def __reduce__(self):
...     return (self.__class__, (self.v,))
>>> a = A("example")
>>> import pickle
>>> b = pickle.loads(pickle.dumps(a))
>>> a.v, b.v
('example', 'example')
>>> a is b
False

回答 5

在python标准库中可以找到它,为unittest.TestLoader.loadTestsFromName。不幸的是,该方法继续进行其他与测试有关的活动,但是,此方法看起来可重复使用。我已经对其进行了编辑,以删除与测试相关的功能:

def get_object(name):
    """Retrieve a python object, given its dotted.name."""
    parts = name.split('.')
    parts_copy = parts[:]
    while parts_copy:
        try:
            module = __import__('.'.join(parts_copy))
            break
        except ImportError:
            del parts_copy[-1]
            if not parts_copy: raise
    parts = parts[1:]

    obj = module
    for part in parts:
        parent, obj = obj, getattr(obj, part)

    return obj

This is found in the python standard library, as unittest.TestLoader.loadTestsFromName. Unfortunately the method goes on to do additional test-related activities, but this first ha looks re-usable. I’ve edited it to remove the test-related functionality:

def get_object(name):
    """Retrieve a python object, given its dotted.name."""
    parts = name.split('.')
    parts_copy = parts[:]
    while parts_copy:
        try:
            module = __import__('.'.join(parts_copy))
            break
        except ImportError:
            del parts_copy[-1]
            if not parts_copy: raise
    parts = parts[1:]

    obj = module
    for part in parts:
        parent, obj = obj, getattr(obj, part)

    return obj

回答 6

我需要获取中所有现有类的对象my_package。因此,我将所有必要的类导入my_package__init__.py

所以我的目录结构是这样的:

/my_package
    - __init__.py
    - module1.py
    - module2.py
    - module3.py

我的__init__.py样子是这样的:

from .module1 import ClassA
from .module2 import ClassB

然后我创建一个像这样的函数:

def get_classes_from_module_name(module_name):
    return [_cls() for _, _cls in inspect.getmembers(__import__(module_name), inspect.isclass)]

哪里 module_name = 'my_package'

检查文档:https : //docs.python.org/3/library/inspect.html#inspect.getmembers

I needed to get objects for all existing classes in my_package. So I import all necessary classes into my_package‘s __init__.py.

So my directory structure is like this:

/my_package
    - __init__.py
    - module1.py
    - module2.py
    - module3.py

And my __init__.py looks like this:

from .module1 import ClassA
from .module2 import ClassB

Then I create a function like this:

def get_classes_from_module_name(module_name):
    return [_cls() for _, _cls in inspect.getmembers(__import__(module_name), inspect.isclass)]

Where module_name = 'my_package'

inspect doc: https://docs.python.org/3/library/inspect.html#inspect.getmembers


为什么要从1970年1月1日开始计算日期?

问题:为什么要从1970年1月1日开始计算日期?

使用date(1970年1月1日)作为时间操纵的默认标准有什么原因吗?我已经在Java和Python中看到了这个标准。我知道这两种语言。还有其他遵循相同标准的流行语言吗?

请描述。

Is there any reason behind using date(January 1st, 1970) as default standard for time manipulation? I have seen this standard in Java as well as in Python. These two languages I am aware of. Are there other popular languages which follows the same standard?

Please describe.


回答 0

这是Unix时间的标准

Unix时间(或POSIX时间)是一种用于描述时间点的系统,时间点定义为自1970年1月1日午夜多点协调世界时(UTC)起经过的秒数,不包括leap秒。

It is the standard of Unix time.

Unix time, or POSIX time, is a system for describing points in time, defined as the number of seconds elapsed since midnight proleptic Coordinated Universal Time (UTC) of January 1, 1970, not counting leap seconds.


回答 1

使用日期(1970年1月1日)作为默认标准

该问题有两个错误的假设:

  • 自1970年以来,计算领域的所有时间跟踪工作都已开始。
  • 这种跟踪是标准的。

两个打时代

从1970年UTC开始并不总是跟踪计算时间。虽然那个时代参考很流行,但几十年来的各种计算环境至少使用了近二十个时代。有些来自其他世纪。范围从0年(零)到2001年。

这里有一些。

公元前1月1日,1月0日

公元1月1日,

1582年10月15日

1601年1月1日

1840年12月31日

1858年11月17日

1899年12月30日

1899年12月31日

1900年1月1日

1904年1月1日

1967年12月31日

1980年1月1日

1980年1月6日

2000年1月1日

2001年1月1日

Unix时代常见,但不占主导地位

1970年初开始流行,可能是因为它被Unix使用。但这绝不是主导。例如:

ISO 8601

假设count-since-epoch正在使用Unix纪元,将给漏洞带来很大的漏洞。对于人类来说,这样的计数是不可能立即解密的,因此在调试和记录日志时,不容易标记出错误或问题。另一个问题是下面解释的粒度不明确。

我强烈建议您将日期时间值序列化为明确的ISO 8601字符串以进行数据交换,而不是将整数count-since-epoch进行序列化:YYYY-MM-DDTHH:MM:SS.SSSZ例如2014-10-14T16:32:41.018Z

什么,自纪元

自计时以来时间跟踪的另一个问题是时间单位,通常至少使用四个级别的分辨率。

using date(January 1st, 1970) as default standard

The Question makes two false assumptions:

  • All time-tracking in computing is done as a count-since-1970.
  • Such tracking is standard.

Two Dozen Epochs

Time in computing is not always tracked from the beginning of 1970 UTC. While that epoch reference is popular, various computing environments over the decades have used at least nearly two dozen epochs. Some are from other centuries. They range from year 0 (zero) to 2001.

Here are a few.

January 0, 1 BC

January 1, AD 1

October 15, 1582

January 1, 1601

December 31, 1840

November 17, 1858

December 30, 1899

December 31, 1899

January 1, 1900

January 1, 1904

December 31, 1967

January 1, 1980

January 6, 1980

January 1, 2000

January 1, 2001

Unix Epoch Common, But Not Dominant

The beginning of 1970 is popular, probably because of its use by Unix. But by no means is that dominant. For example:

  • Countless millions (billions?) of Microsoft Excel & Lotus 1-2-3 documents use January 0, 1900 (December 31, 1899).
  • The world now has over a billion iOS/OS X devices using the Cocoa (NSDate) epoch of 1 January 2001, GMT.
  • The GPS satellite navigation system uses January 6, 1980 while the European alternative Galileo uses 22 August 1999.

ISO 8601

Assuming a count-since-epoch is using the Unix epoch is opening a big vulnerability for bugs. Such a count is impossible for a human to instantly decipher, so errors or issues won’t be easily flagged when debugging and logging. Another problem is the ambiguity of granularity explained below.

I strongly suggest instead serializing date-time values as unambiguous ISO 8601 strings for data interchange rather than an integer count-since-epoch: YYYY-MM-DDTHH:MM:SS.SSSZ such as 2014-10-14T16:32:41.018Z.

Count Of What Since Epoch

Another issue with count-since-epoch time tracking is the time unit, with at least four levels of resolution commonly used.


回答 2

为什么总是1970年1月1日,因为-1970年1月1日通常称为“时代日期”,是Unix计算机的开始时间,该时间戳记为“ 0”。自该日期以来的任何时间都是根据经过的秒数计算的。用简单的话来说…任何日期的时间戳都将是该日期与“ 1970年1月1日”之间的秒数差异。时间戳只是一个整数,它从“ 1970年1月1日午夜”的数字“ 0”开始并继续递增每隔一秒钟传递“ 1”,以将UNIX时间戳转换为可读日期PHP和其他开源语言提供了内置函数。

why its always 1st jan 1970 , Because – ‘1st January 1970’ usually called as “epoch date” is the date when the time started for Unix computers, and that timestamp is marked as ‘0’. Any time since that date is calculated based on the number of seconds elapsed. In simpler words… the timestamp of any date will be difference in seconds between that date and ‘1st January 1970’ The time stamp is just a integer which started from number ‘0’ on ‘Midnight 1st January 1970’ and goes on incrementing by ‘1’ as each second pass For conversion of UNIX timestamps to readable dates PHP and other open source languages provides built in functions.


回答 3

使用date(1970年1月1日)作为时间操纵的标准背后有什么原因吗?

没关系。

Python的time模块 C库。问肯·汤普森(Ken Thompson)为什么他选择那个日期作为一个时代的日期。也许是某人的生日。

Excel使用两个不同的时期。为什么不同版本的excel使用不同的日期?

除了真正的程序员之外,没有其他人会知道为什么做出这些决定。

和…

不要紧,为什么选择的日期。只是。

天文学家使用他们自己的时代日期:http : //en.wikipedia.org/wiki/Epoch_(天文学

为什么?必须选择一个日期来计算数学。任何随机的日期都可以。

过去的日期通常会避免使用负数。

一些更聪明的软件包使用了多功的公历1年。为什么是1年?
在诸如Calendrical Calculations之类的书中给出了一个原因:从数学上讲它稍微简单一些。

但是如果您考虑一下,1/1/1和1/1/1970之间的差异只是1969,这是一个微不足道的数学偏移量。

Is there any reason behind using date(January 1st, 1970) as standard for time manipulation?

No reason that matters.

Python’s time module is the C library. Ask Ken Thompson why he chose that date for an epochal date. Maybe it was someone’s birthday.

Excel uses two different epochs. Any reason why different version of excel use different dates?

Except for the actual programmer, no one else will ever know why those those kinds of decisions were made.

And…

It does not matter why the date was chosen. It just was.

Astronomers use their own epochal date: http://en.wikipedia.org/wiki/Epoch_(astronomy)

Why? A date has to be chosen to make the math work out. Any random date will work.

A date far in the past avoids negative numbers for the general case.

Some of the smarter packages use the proleptic Gregorian year 1. Any reason why year 1?
There’s a reason given in books like Calendrical Calculations: it’s mathematically slightly simpler.

But if you think about it, the difference between 1/1/1 and 1/1/1970 is just 1969, a trivial mathematical offset.


回答 4

1970年1月1日上午00:00:00是POSIX时间的零点。

January 1st, 1970 00:00:00 am is the zero-point of POSIX time.


回答 5

问: “为什么要从1970年1月1日开始计算日期?”

A)它必须尽可能新,但要包含一些过去。很多人也有同样的感觉,很可能没有其他明显的原因。

他们知道如果把它放到过去就太远了,就会构成问题;如果知道将来会带来负面的结果,他们就会知道。由于事件很可能会在未来发生,因此过去无需深入探讨。

注意: 另一方面,玛雅人需要将事件放到过去,因为他们了解很多过去,因此他们进行了长期压延。只是将所有常规现象放在压延机上。

时间戳并非日历,而是一个时代。而且我相信,玛雅人也使用相同的观点进行了长期压延。(这意味着他们非常了解自己与过去没有任何关系,只需要从更大的角度来看待过去)

Q) “Why are dates calculated from January 1st, 1970?”

A) It had to be as recent as possible, yet include some past. There was most likely no significant other reason as a lot of people feel that same way.

They knew it posed a problem if they placed it too far into the past and they knew it gave negative results if it was in the future. There was no need to go deeper in the past as events will most likely take place in the future.

Notes: The mayans, on the other hand, had the need to place events into the past, since the had the knowledge of a lot of past, for which they made a long-term calender. Just to place all the routine phenomena on the calender.

The timestamp was not meant as a calender, it’s an Epoch. And I believe the mayans made their long-term calender using that same perspective. (meaning they knew damn well they didn’t have any relations with the past, they just had the need to see it in a bigger scale)


回答 6

是的,C(及其家人)。这也是Java采取的方法。

Yes, C (and its family). This is where Java took it too.


WhatsApp API(java / python)[关闭]

问题:WhatsApp API(java / python)[关闭]

我正在寻找WhatsApp API,最好是Python或Java库。

我已经尝试过Yowsup,但是无法注册我的电话号码;我住在印度,不知道是否与印度有任何关系。

我确实尝试过WhatsAPI(Python库),但也不起作用。

有什么建议吗?这里有Yowsup的用户吗?

I am looking for WhatsApp API, preferably a Python or Java library.

I’ve tried Yowsup, but could not get my number registered; I am based in India and I am not sure if that has got anything to do with it.

I did try WhatsAPI (Python library) but it is not working either.

Any suggestions about this? Any users of Yowsup here?


回答 0

尝试了一切之后,Yowsup库为我工作。我所面对的错误已得到修复。任何尝试使用Whatsapp做某事的人都应该尝试一下。

After trying everything, Yowsup library worked for me. The bug that I was facing was recently fixed. Anyone trying to do something with Whatsapp should try it.


回答 1

从我的博客

礼貌

WhatsApp正在与特定企业合作开展一项秘密试点计划

新闻报道:

对于我的一些技术实验,我试图从市场份额以及适应的可能性方面,弄清为不同的聊天平台实现机器人的好处和可行性。尤其是当您破产两次失败时,重要的是验证想法并更快地失败。

流行的聊天平台,例如MessengerSlack Skype等,已经很高兴(正式意义上)提供了与机器人进行交互的API,但是WhatsApp尚未提供任何API。

然而,多年来,围绕此发生了许多活动-努力与WhatsApp平台进行自动交互:

  1. Bots App Bots App很有趣,因为它表明某些东西确实经过尝试和测试。

  2. Yowsup 一个仍在积极开发以与WhatsApp平台进行交互的项目。

  3. Yallagenie Yallagenie声称有一个演示机器人可以与+971 56 112 6652进行交互

  4. Hubtype Hubtype正在努力为商务WhatsApp建立一个机器人平台。

  5. 弗雷德· 弗雷德(Fred Fred)的任务是使WhatsApp对话自动化,但是由于WhatsApp并未正式支持该对话,因此已将其关闭。

  6. Oye Gennie, 一个 WhatsApp 阻止的机器人。

  7. 应用程序/网站到WhatsApp 我们可以使用自定义URL方案和Android意向系统与WhatsApp进行交互,但仍不能与WhatsApp API进行交互。

  8. 聊天API守护程序 可能是通过检查WhatsApp Web版本中的API调用而创建的。不隶属于WhatsApp。

  9. WhatsBot 停用了WhatsApp机器人。在hackathon期间创建的。

  10. 没有API声明 WhatsApp联合创始人在一次会议上明确表示他们没有针对WhatsApp API的任何计划。

  11. Bot Ware 他们可能期望WhatsApp为聊天机器人平台发布其API。

  12. Vixi 他们似乎在谈论某种平台可能适用于WhatsApp。尚无明确说明。

  13. 非官方API 该API可以随时关闭。

    而且这个数字还在继续…

From my blog

courtesy

There is a secret pilot program which WhatsApp is working on with selected businesses

News coverage:

For some of my technical experiments, I was trying to figure out how beneficial and feasible it is to implement bots for different chat platforms in terms of market share and so possibilities of adaptation. Especially when you have bankruptly failed twice, it’s important to validate ideas and fail more faster.

Popular chat platforms like Messenger, Slack, Skype etc. have happily (in the sense officially) provided APIs for bots to interact with, but WhatsApp has not yet provided any API.

However, since many years, a lot of activities has happened around this – struggle towards automated interaction with WhatsApp platform:

  1. Bots App Bots App is interesting because it shows that something is really tried and tested.

  2. Yowsup A project still actively developed to interact with WhatsApp platform.

  3. Yallagenie Yallagenie claim that there is a demo bot which can be interacted with at +971 56 112 6652

  4. Hubtype Hubtype is working towards having a bot platform for WhatsApp for business.

  5. Fred Fred’s task was to automate WhatsApp conversations, however since it was not officially supported by WhatsApp – it was shut down.

  6. Oye Gennie A bot blocked by WhatsApp.

  7. App/Website to WhatsApp We can use custom URL schemes and Android intent system to interact with WhatsApp but still NOT WhatsApp API.

  8. Chat API daemon Probably created by inspecting the API calls in WhatsApp web version. NOT affiliated with WhatsApp.

  9. WhatsBot Deactivated WhatsApp bot. Created during a hackathon.

  10. No API claim WhatsApp co-founder clearly stated this in a conference that they did not had any plans for APIs for WhatsApp.

  11. Bot Ware They probably are expecting WhatsApp to release their APIs for chat bot platforms.

  12. Vixi They seems to be talking about how some platform which probably would work for WhatsApp. There is no clarity as such.

  13. Unofficial API This API can shut off any time.

    And the number goes on…


回答 2

Yowsup提供了最佳的解决方案示例。您可以从https://github.com/tgalal/yowsup下载api,如果有任何问题,告诉我。

Yowsup provide best solution with example.you can download api from https://github.com/tgalal/yowsup let me know if you have any issue.


回答 3

WhatsApp Inc.不提供开放的API,但由GitHub上的Venomous团队在GitHub上提供了反向工程库。但是据我所知,这在PHP中成为可能。您可以在此处查看链接:https : //github.com/venomous0x/WhatsAPI

希望这可以帮助

WhatsApp Inc. does not provide an open API but a reverse-engineered library is made available on GitHub by the team Venomous on the GitHub. This however according to my knowledge is made possible in PHP. You can check the link here: https://github.com/venomous0x/WhatsAPI

Hope this helps


回答 4

这是Open WhatsApp官方页面的开发人员页面:http : //openwhatsapp.org/develop/

您可以在那里找到有关Yowsup的很多信息。

或者,您可以直接访问库的链接(无论如何我都是从Open WhatsApp页面复制的):https : //github.com/tgalal/yowsup

请享用!

This is the developers page of the Open WhatsApp official page: http://openwhatsapp.org/develop/

You can find a lot of information there about Yowsup.

Or, you can just go the the library’s link (which I copied from the Open WhatsApp page anyway): https://github.com/tgalal/yowsup

Enjoy!


如何可靠地猜测MacRoman,CP1252,Latin1,UTF-8和ASCII之间的编码

问题:如何可靠地猜测MacRoman,CP1252,Latin1,UTF-8和ASCII之间的编码

在工作中,似乎没有一周没有编码相关的混乱,灾难或灾难。问题通常来自程序员,他们认为他们无需指定编码就可以可靠地处理“文本”文件。但是你不能。

因此,已决定从此以后禁止文件以*.txt或结尾的文件名*.text。这种想法是,这些扩展误导了随意的程序员对编码的沉闷自满,这会导致处理不当。根本没有扩展将是更好的选择,因为至少您知道自己不知道拥有什么。

但是,我们不会走那么远。相反,您将期望使用以编码结尾的文件名。因此,对于文本文件,例如,这些会是这样README.asciiREADME.latin1README.utf8,等。

对于需要特定扩展名的文件,如果可以在文件本身内部指定编码,例如在Perl或Python中,则应这样做。对于Java源之类的文件,其中文件内部没有这样的功能,您可以将编码放在扩展名之前,例如SomeClass-utf8.java

对于输出,强烈建议使用 UTF-8 。

但是作为输入,我们需要弄清楚如何处理代码库中名为的数千个文件*.txt。我们想重命名所有这些以适应我们的新标准。但是我们不可能全神贯注。因此,我们需要一个实际起作用的库或程序。

这些格式有ASCII,ISO-8859-1,UTF-8,Microsoft CP1252或Apple MacRoman。尽管我们知道我们可以判断某些东西是否为ASCII,并且知道有某种东西可能是UTF-8还是一个不错的选择,但我们对8位编码感到困惑。因为我们在大多数台式机为Mac的混合Unix环境(Solaris,Linux,Darwin)中运行,所以我们有很多烦人的MacRoman文件。这些尤其是一个问题。

一段时间以来,我一直在寻找一种以编程方式确定

  1. ASCII码
  2. ISO-8859-1
  3. CP1252
  4. 麦克罗曼
  5. UTF-8

文件在其中,我还没有找到可以可靠地区分这三种不同的8位编码的程序或库。我们可能仅拥有一千多个MacRoman文件,因此我们使用的任何字符集检测器都必须能够将它们嗅出。我看过的东西都无法解决这个问题。我对ICU字符集检测器库寄予厚望,但它不能处理MacRoman。我也研究过模块,它们在Perl和Python中都可以做同样的事情,但是一遍又一遍地是同一回事:不支持检测MacRoman。

因此,我要寻找的是一个现有的库或程序,该库或程序可以可靠地确定文件所用的五种编码中的哪一种(最好是更多)。特别是它必须区分我引用的三种3位编码,尤其是MacRoman。文件是超过99%的英语文本;还有其他几种语言,但不是很多。

如果是库代码,则我们的语言偏好是按Perl,C,Java或Python的顺序排列。如果它只是一个程序,那么我们并不在乎它的语言是什么,只要它是完整的源代码,在Unix上运行并且完全不受限制即可。

还有其他人遇到过随机编码成千上万个旧文本文件的问题吗?如果是这样,您是如何尝试解决它的?您的成功程度如何?这是我的问题中最重要的方面,但是我也很感兴趣您是否鼓励程序员使用文件中的实际编码来命名(或重命名)他们的文件,这将有助于我们将来避免此问题。有没有人曾经尝试过在制度基础上强制执行,如果成功,那么成功与否,为什么?

是的,我完全理解,考虑到问题的性质,为什么不能保证给出确切的答案。对于小文件,尤其是这种情况,因为您没有足够的数据继续运行。幸运的是,我们的文件很少。除了随机README文件外,大多数文件的大小在50k到250k之间,许多文件更大。大小超过K的任何内容都将保证使用英语。

问题领域是生物医学文本挖掘,因此我们有时会处理大量的超大型语料库,例如PubMedCentral的所有Open Access存储库。一个相当大的文件是BioThesaurus 6.0,容量为5.7 GB。该文件特别令人讨厌,因为它几乎都是UTF-8。但是,我相信有些numbskull会以一些8位编码插入其中的几行,即Microsoft CP1252。您需要花费相当长的时间才能踏上那个旅程。:(

At work it seems like no week ever passes without some encoding-related conniption, calamity, or catastrophe. The problem usually derives from programmers who think they can reliably process a “text” file without specifying the encoding. But you can’t.

So it’s been decided to henceforth forbid files from ever having names that end in *.txt or *.text. The thinking is that those extensions mislead the casual programmer into a dull complacency regarding encodings, and this leads to improper handling. It would almost be better to have no extension at all, because at least then you know that you don’t know what you’ve got.

However, we aren’t goint to go that far. Instead you will be expected to use a filename that ends in the encoding. So for text files, for example, these would be something like README.ascii, README.latin1, README.utf8, etc.

For files that demand a particular extension, if one can specify the encoding inside the file itself, such as in Perl or Python, then you shall do that. For files like Java source where no such facility exists internal to the file, you will put the encoding before the extension, such as SomeClass-utf8.java.

For output, UTF-8 is to be strongly preferred.

But for input, we need to figure out how to deal with the thousands of files in our codebase named *.txt. We want to rename all of them to fit into our new standard. But we can’t possibly eyeball them all. So we need a library or program that actually works.

These are variously in ASCII, ISO-8859-1, UTF-8, Microsoft CP1252, or Apple MacRoman. Although we’re know we can tell if something is ASCII, and we stand a good change of knowing if something is probably UTF-8, we’re stumped about the 8-bit encodings. Because we’re running in a mixed Unix environment (Solaris, Linux, Darwin) with most desktops being Macs, we have quite a few annoying MacRoman files. And these especially are a problem.

For some time now I’ve been looking for a way to programmatically determine which of

  1. ASCII
  2. ISO-8859-1
  3. CP1252
  4. MacRoman
  5. UTF-8

a file is in, and I haven’t found a program or library that can reliably distinguish between those the three different 8-bit encodings. We probably have over a thousand MacRoman files alone, so whatever charset detector we use has to be able to sniff those out. Nothing I’ve looked at can manage the trick. I had big hopes for the ICU charset detector library, but it cannot handle MacRoman. I’ve also looked at modules to do the same sort of thing in both Perl and Python, but again and again it’s always the same story: no support for detecting MacRoman.

What I am therefore looking for is an existing library or program that reliably determines which of those five encodings a file is in—and preferably more than that. In particular it has to distinguish between the three 3-bit encoding I’ve cited, especially MacRoman. The files are more than 99% English language text; there are a few in other languages, but not many.

If it’s library code, our language preference is for it to be in Perl, C, Java, or Python, and in that order. If it’s just a program, then we don’t really care what language it’s in so long as it comes in full source, runs on Unix, and is fully unencumbered.

Has anyone else had this problem of a zillion legacy text files randomly encoded? If so, how did you attempt to solve it, and how successful were you? This is the most important aspect of my question, but I’m also interested in whether you think encouraging programmers to name (or rename) their files with the actual encoding those files are in will help us avoid the problem in the future. Has anyone ever tried to enforce this on an institutional basis, and if so, was that successful or not, and why?

And yes, I fully understand why one cannot guarantee a definite answer given the nature of the problem. This is especially the case with small files, where you don’t have enough data to go on. Fortunately, our files are seldom small. Apart from the random README file, most are in the size range of 50k to 250k, and many are larger. Anything more than a few K in size is guaranteed to be in English.

The problem domain is biomedical text mining, so we sometimes deal with extensive and extremely large corpora, like all of PubMedCentral’s Open Access respository. A rather huge file is the BioThesaurus 6.0, at 5.7 gigabytes. This file is especially annoying because it is almost all UTF-8. However, some numbskull went and stuck a few lines in it that are in some 8-bit encoding—Microsoft CP1252, I believe. It takes quite a while before you trip on that one. :(


回答 0

首先,简单的情况:

ASCII码

如果您的数据不包含大于0x7F的字节,则为ASCII。(或者是7位ISO646编码,但是已经过时了。)

UTF-8

如果您的数据验证为UTF-8,则可以放心地假定它 UTF-8。由于UTF-8严格的验证规则,误报极为罕见。

ISO-8859-1与Windows-1252

这两种编码之间的唯一区别是ISO-8859-1具有C1控制字符,而Windows-1252具有可打印字符€,ƒ„…†‡ˆ‰Š‹ŒŽ“”•-〜™š› œžŸ。我见过很多使用大括号或破折号的文件,但是没有使用C1控制字符的文件。因此,甚至不必理会它们或ISO-8859-1,而只需检测Windows-1252。

现在只剩下一个问题了。

您如何区分MacRoman和cp1252?

这要复杂得多。

未定义的字符

Windows-1252中未使用字节0x81、0x8D,0x8F,0x90、0x9D。如果发生这种情况,则假定数据为MacRoman。

相同字符

两种编码中的字节0xA2(¢),0xA3(£),0xA9(©),0xB1(±),0xB5(µ)都相同。如果这些是唯一的非ASCII字节,那么选择MacRoman还是cp1252都没有关系。

统计方法

在您知道为UTF-8的数据中计算字符(非字节!)频率。确定最频繁的字符。然后使用此数据确定cp1252或MacRoman字符是否更常见。

例如,在我仅对100条随机英语Wikipedia文章执行的搜索中,最常见的非ASCII字符为·•–é°®’èö—。基于这个事实,

  • 字节0x92、0x95、0x96、0x97、0xAE,0xB0、0xB7、0xE8、0xE9或0xF6表示Windows-1252。
  • 字节0x8E,0x8F,0x9A,0xA1、0xA5、0xA8、0xD0、0xD1、0xD5或0xE1表示MacRoman。

计算cp1252建议字节和MacRoman建议字节,并选择最大的一个。

First, the easy cases:

ASCII

If your data contains no bytes above 0x7F, then it’s ASCII. (Or a 7-bit ISO646 encoding, but those are very obsolete.)

UTF-8

If your data validates as UTF-8, then you can safely assume it is UTF-8. Due to UTF-8’s strict validation rules, false positives are extremely rare.

ISO-8859-1 vs. windows-1252

The only difference between these two encodings is that ISO-8859-1 has the C1 control characters where windows-1252 has the printable characters €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ. I’ve seen plenty of files that use curly quotes or dashes, but none that use C1 control characters. So don’t even bother with them, or ISO-8859-1, just detect windows-1252 instead.

That now leaves you with only one question.

How do you distinguish MacRoman from cp1252?

This is a lot trickier.

Undefined characters

The bytes 0x81, 0x8D, 0x8F, 0x90, 0x9D are not used in windows-1252. If they occur, then assume the data is MacRoman.

Identical characters

The bytes 0xA2 (¢), 0xA3 (£), 0xA9 (©), 0xB1 (±), 0xB5 (µ) happen to be the same in both encodings. If these are the only non-ASCII bytes, then it doesn’t matter whether you choose MacRoman or cp1252.

Statistical approach

Count character (NOT byte!) frequencies in the data you know to be UTF-8. Determine the most frequent characters. Then use this data to determine whether the cp1252 or MacRoman characters are more common.

For example, in a search I just performed on 100 random English Wikipedia articles, the most common non-ASCII characters are ·•–é°®’èö—. Based on this fact,

  • The bytes 0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, or 0xF6 suggest windows-1252.
  • The bytes 0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, or 0xE1 suggest MacRoman.

Count up the cp1252-suggesting bytes and the MacRoman-suggesting bytes, and go with whichever is greatest.


回答 1

Mozilla nsUniversalDetector(Perl绑定:Encode :: Detect / Encode :: Detect :: Detector)已被证明了百万倍。


回答 2

我尝试进行这种试探(假设您已经排除了ASCII和UTF-8):

  • 如果根本不显示0x7f到0x9f,则可能是ISO-8859-1,因为它们是很少使用的控制代码。
  • 如果大量出现0x91到0x94,则可能是Windows-1252,因为它们是“智能引号”,是该范围内最有可能在英文文本中使用的字符。可以肯定的是,您可以寻找对。
  • 否则,它是MacRoman,尤其是如果您看到很多0xd2到0xd5(在MacRoman中是印刷引号)。

边注:

对于像Java源这样的文件,其中文件内部没有这种功能,您可以将编码放在扩展名之前,例如SomeClass-utf8.java。

不要这样做!!

Java编译器期望文件名与类名匹配,因此重命名文件将使源代码不可编译。正确的做法是猜测编码,然后使用该native2ascii工具将所有非ASCII字符转换为Unicode转义序列

My attempt at such a heuristic (assuming that you’ve ruled out ASCII and UTF-8):

  • If 0x7f to 0x9f don’t appear at all, it’s probably ISO-8859-1, because those are very rarely used control codes.
  • If 0x91 through 0x94 appear at lot, it’s probably Windows-1252, because those are the “smart quotes”, by far the most likely characters in that range to be used in English text. To be more certain, you could look for pairs.
  • Otherwise, it’s MacRoman, especially if you see a lot of 0xd2 through 0xd5 (that’s where the typographic quotes are in MacRoman).

Side note:

For files like Java source where no such facility exists internal to the file, you will put the encoding before the extension, such as SomeClass-utf8.java

Do not do this!!

The Java compiler expects file names to match class names, so renaming the files will render the source code uncompilable. The correct thing would be to guess the encoding, then use the native2ascii tool to convert all non-ASCII characters to Unicode escape sequences.


回答 3

“ Perl,C,Java或Python,并按此顺序”:有趣的态度:-)

“我们知道一个东西是否可能是UTF-8,这是一个很好的改变”:实际上,当UTF-8很小时,包含以其他字符集编码的,使用高位字节的有意义文本的文件将成功解码的机会。

UTF-8策略(至少使用首选语言):

# 100% Unicode-standard-compliant UTF-8
def utf8_strict(text):
    try:
        text.decode('utf8')
        return True
    except UnicodeDecodeError:
        return False

# looking for almost all UTF-8 with some junk
def utf8_replace(text):
    utext = text.decode('utf8', 'replace')
    dodgy_count = utext.count(u'\uFFFD') 
    return dodgy_count, utext
    # further action depends on how large dodgy_count / float(len(utext)) is

# checking for UTF-8 structure but non-compliant
# e.g. encoded surrogates, not minimal length, more than 4 bytes:
# Can be done with a regex, if you need it

一旦确定它既不是ASCII也不是UTF-8:

我知道的Mozilla起源字符集检测器不支持MacRoman,而且无论如何在8位字符集上都做得不好,尤其是对于英语,因为AFAICT依赖于检查给定解码是否有意义语言,忽略标点符号,并基于该语言的大量文档。

正如其他人所说的,您实际上只有高位标点符号可用于区分cp1252和macroman。我建议您在自己的文档上训练Mozilla类型的模型,而不是莎士比亚,《议事录》或《圣经》,并考虑所有256个字节。我认为您的文件中没有标记(HTML,XML等),这会使某些令人震惊的概率失真。

您提到的文件大多为UTF-8,但无法解码。您还应该非常怀疑:

(1)据称是用ISO-8859-1编码的文件,但包含范围在0x80至0x9F(包括0x80至0x9F)内的“控制字符” …这太普遍了,以至于HTML5标准草案表示要解码所有声明为ISO-8859的HTML流-1使用cp1252。

(2)将OK解码为UTF-8的文件,但所得的Unicode包含范围在U + 0080至U + 009F(含)范围内的“控制字符” …这可能是由于对cp1252 / cp850进行代码转换(见它发生了!)/等等文件从“ ISO-8859-1”到UTF-8。

背景:我有一个星期天下午下午的项目,以创建一个基于Python的字符集检测器,该检测器面向文件(而不是面向Web),并且可以与8位字符集(包括legacy ** ncp850和cp437等)一起很好地工作。现在还远没有黄金时间。我对培训文件感兴趣;您的ISO-8859-1 / cp1252 / MacRoman文件是否像您期望任何人的代码解决方案一样“不受阻碍”?

“Perl, C, Java, or Python, and in that order”: interesting attitude :-)

“we stand a good change of knowing if something is probably UTF-8”: Actually the chance that a file containing meaningful text encoded in some other charset that uses high-bit-set bytes will decode successfully as UTF-8 is vanishingly small.

UTF-8 strategies (in least preferred language):

# 100% Unicode-standard-compliant UTF-8
def utf8_strict(text):
    try:
        text.decode('utf8')
        return True
    except UnicodeDecodeError:
        return False

# looking for almost all UTF-8 with some junk
def utf8_replace(text):
    utext = text.decode('utf8', 'replace')
    dodgy_count = utext.count(u'\uFFFD') 
    return dodgy_count, utext
    # further action depends on how large dodgy_count / float(len(utext)) is

# checking for UTF-8 structure but non-compliant
# e.g. encoded surrogates, not minimal length, more than 4 bytes:
# Can be done with a regex, if you need it

Once you’ve decided that it’s neither ASCII nor UTF-8:

The Mozilla-origin charset detectors that I’m aware of don’t support MacRoman and in any case don’t do a good job on 8-bit charsets especially with English because AFAICT they depend on checking whether the decoding makes sense in the given language, ignoring the punctuation characters, and based on a wide selection of documents in that language.

As others have remarked, you really only have the high-bit-set punctuation characters available to distinguish between cp1252 and macroman. I’d suggest training a Mozilla-type model on your own documents, not Shakespeare or Hansard or the KJV Bible, and taking all 256 bytes into account. I presume that your files have no markup (HTML, XML, etc) in them — that would distort the probabilities something shocking.

You’ve mentioned files that are mostly UTF-8 but fail to decode. You should also be very suspicious of:

(1) files that are allegedly encoded in ISO-8859-1 but contain “control characters” in the range 0x80 to 0x9F inclusive … this is so prevalent that the draft HTML5 standard says to decode ALL HTML streams declared as ISO-8859-1 using cp1252.

(2) files that decode OK as UTF-8 but the resultant Unicode contains “control characters” in the range U+0080 to U+009F inclusive … this can result from transcoding cp1252 / cp850 (seen it happen!) / etc files from “ISO-8859-1” to UTF-8.

Background: I have a wet-Sunday-afternoon project to create a Python-based charset detector that’s file-oriented (instead of web-oriented) and works well with 8-bit character sets including legacy ** n ones like cp850 and cp437. It’s nowhere near prime time yet. I’m interested in training files; are your ISO-8859-1 / cp1252 / MacRoman files as equally “unencumbered” as you expect anyone’s code solution to be?


回答 4

您已经发现,没有完美的方法来解决此问题,因为如果没有关于文件使用哪种编码的隐式知识,所有8位编码都是完全相同的:字节的集合。所有字节对于所有8位编码均有效。

您可以期望的最好结果是某种算法,可以分析字节,并基于以某种语言以某种编码使用某种字节的概率,可以猜测文件使用的编码方式。但这必须知道文件使用哪种语言,并且当您使用混合编码的文件时,它变得完全无用。

从好的方面来说,如果您知道文件中的文本是用英语编写的,那么您决定使用该文件的任何编码都不会引起任何差异,因为所有提到的编码之间的差异都本地化在编码的一部分,指定了英语中通常不使用的字符。在文本使用特殊格式或特殊版本的标点符号(例如CP1252具有引号字符的多个版本)的情况下,您可能会遇到一些麻烦,但是对于文本的要旨而言,可能没有任何问题。

As you have discovered, there is no perfect way to solve this problem, because without the implicit knowledge about which encoding a file uses, all 8-bit encodings are exactly the same: A collection of bytes. All bytes are valid for all 8-bit encodings.

The best you can hope for, is some sort of algorithm that analyzes the bytes, and based on probabilities of a certain byte being used in a certain language with a certain encoding will guess at what encoding the files uses. But that has to know which language the file uses, and becomes completely useless when you have files with mixed encodings.

On the upside, if you know that the text in a file is written in English, then the you’re unlikely to notice any difference whichever encoding you decide to use for that file, as the differences between all the mentioned encodings are all localized in the parts of the encodings that specify characters not normally used in the English language. You might have some troubles where the text uses special formatting, or special versions of punctuation (CP1252 has several versions of the quote characters for instance), but for the gist of the text there will probably be no problems.


回答 5

如果您可以检测到除宏人以外的所有编码,那么逻辑上是假设无法解密的是宏人。换句话说,只要列出无法处理的文件,然后将其视为宏文件即可。

排序这些文件的另一种方法是制作一个基于服务器的程序,该程序允许用户确定哪种编码不乱码。当然,这将在公司内部,但是如果有100名员工每天做几次工作,那么您将立即拥有成千上万的文件。

最后,将所有现有文件转换为单一格式并要求新文件采用该格式不是更好。

If you can detect every encoding EXCEPT for macroman, than it would be logical to assume that the ones that can’t be deciphered are in macroman. In other words, just make a list of files that couldn’t be processed and handle those as if they were macroman.

Another way to sort these files would be to make a server based program that allows users to decide which encoding isn’t garbled. Of course, it would be within the company, but with 100 employees doing a few each day, you’ll have thousands of files done in no time.

Finally, wouldn’t it be better to just convert all existing files to a single format, and require that new files be in that format.


回答 6

还有其他人遇到过随机编码成千上万个旧文本文件的问题吗?如果是这样,您是如何尝试解决它的?您的成功程度如何?

我目前正在编写将文件转换为XML的程序。它必须自动检测每个文件的类型,这是确定文本文件编码问题的超集。为了确定编码,我使用贝叶斯方法。也就是说,我的分类代码针对文本文件能够理解的所有编码,计算出文本文件具有特定编码的概率(可能性)。然后,程序选择最可能的解码器。对于每种编码,贝叶斯方法都像这样工作。

  1. 根据每次编码的频率,设置文件在编码中的初始(优先)概率。
  2. 依次检查文件中的每个字节。查找字节值,以确定该字节值存在与该编码中实际存在的文件之间的相关性。使用该相关性来计算新的(后验文件在编码中)概率。如果要检查的字节更多,请在检查下一个字节时将该字节的后验概率用作先验概率。
  3. 当您到达文件末尾时(我实际上仅查看前1024个字节),则具有的可能性就是文件处于编码状态的可能性。

可以看出,如果您计算信息内容而不是计算概率,而这是几率的对数,那么贝叶斯定理变得非常容易做到:info = log(p / (1.0 - p))

您将必须通过检查手动分类的文件的语料库来计算初始先验概率和相关性。

Has anyone else had this problem of a zillion legacy text files randomly encoded? If so, how did you attempt to solve it, and how successful were you?

I am currently writing a program that translates files into XML. It has to autodetect the type of each file, which is a superset of the problem of determining the encoding of a text file. For determining the encoding I am using a Bayesian approach. That is, my classification code computes a probability (likelihood) that a text file has a particular encoding for all the encodings it understands. The program then selects the most probable decoder. The Bayesian approach works like this for each encoding.

  1. Set the initial (prior) probability that the file is in the encoding, based on the frequencies of each encoding.
  2. Examine each byte in turn in the file. Look-up the byte value to determine the correlation between that byte value being present and a file actually being in that encoding. Use that correlation to compute a new (posterior) probability that the file is in the encoding. If you have more bytes to examine, use the posterior probability of that byte as the prior probability when you examine the next byte.
  3. When you get to the end of the file (I actually look at only the first 1024 bytes), the proability you have is the probability that the file is in the encoding.

It transpires that Bayes’ theorem becomes very easy to do if instead of computing probabilities, you compute information content, which is the logarithm of the odds: info = log(p / (1.0 - p)).

You will have to compute the initail priori probability, and the correlations, by examining a corpus of files that you have manually classified.


Java:是否等效于Python的range(int,int)?

问题:Java:是否等效于Python的range(int,int)?

Java是否具有等效于Python range(int, int)方法的方法?

Does Java have an equivalent to Python’s range(int, int) method?


回答 0

Guava还提供类似于Python的东西range

Range.closed(1, 5).asSet(DiscreteDomains.integers());

您也可以使用Guava的AbstractIterator实现一个相当简单的迭代器来执行相同的操作:

return new AbstractIterator<Integer>() {
  int next = getStart();

  @Override protected Integer computeNext() {
    if (isBeyondEnd(next)) {
      return endOfData();
    }
    Integer result = next;
    next = next + getStep();
    return result;
  }
};

Guava also provides something similar to Python’s range:

Range.closed(1, 5).asSet(DiscreteDomains.integers());

You can also implement a fairly simple iterator to do the same sort of thing using Guava’s AbstractIterator:

return new AbstractIterator<Integer>() {
  int next = getStart();

  @Override protected Integer computeNext() {
    if (isBeyondEnd(next)) {
      return endOfData();
    }
    Integer result = next;
    next = next + getStep();
    return result;
  }
};

回答 1

旧问题,新答案(对于Java 8)

    IntStream.range(0, 10).forEach(
        n -> {
            System.out.println(n);
        }
    );

或带有方法引用:

IntStream.range(0, 10).forEach(System.out::println);

Old question, new answer (for Java 8)

    IntStream.range(0, 10).forEach(
        n -> {
            System.out.println(n);
        }
    );

or with method references:

IntStream.range(0, 10).forEach(System.out::println);

回答 2

从Guava 15.0开始,Range.asSet()已被弃用,并计划在版本16中删除。请改用以下命令:

ContiguousSet.create(Range.closed(1, 5), DiscreteDomain.integers());

Since Guava 15.0, Range.asSet() has been deprecated and is scheduled to be removed in version 16. Use the following instead:

ContiguousSet.create(Range.closed(1, 5), DiscreteDomain.integers());

回答 3

我正在研究一个名为Jools的 Java utils小库,它包含一个Range提供所需功能的类(有一个可下载的JAR)。
构造函数可以是Range(int stop)Range(int start, int stop)Range(int start, int stop, int step)(类似于for循环),您可以对其进行迭代(使用惰性求值),也可以使用其toList()方法显式获取范围列表。

for (int i : new Range(10)) {...} // i = 0,1,2,3,4,5,6,7,8,9

for (int i : new Range(4,10)) {...} // i = 4,5,6,7,8,9

for (int i : new Range(0,10,2)) {...} // i = 0,2,4,6,8

Range range = new Range(0,10,2);
range.toList(); // [0,2,4,6,8]

I’m working on a little Java utils library called Jools, and it contains a class Range which provides the functionality you need (there’s a downloadable JAR).
Constructors are either Range(int stop), Range(int start, int stop), or Range(int start, int stop, int step) (similiar to a for loop) and you can either iterate through it, which used lazy evaluation, or you can use its toList() method to explicitly get the range list.

for (int i : new Range(10)) {...} // i = 0,1,2,3,4,5,6,7,8,9

for (int i : new Range(4,10)) {...} // i = 4,5,6,7,8,9

for (int i : new Range(0,10,2)) {...} // i = 0,2,4,6,8

Range range = new Range(0,10,2);
range.toList(); // [0,2,4,6,8]

回答 4

public int[] range(int start, int stop)
{
   int[] result = new int[stop-start];

   for(int i=0;i<stop-start;i++)
      result[i] = start+i;

   return result;
}

原谅任何语法或样式错误;我通常使用C#编程。

public int[] range(int start, int stop)
{
   int[] result = new int[stop-start];

   for(int i=0;i<stop-start;i++)
      result[i] = start+i;

   return result;
}

Forgive any syntax or style errors; I normally program in C#.


回答 5

您可以使用以下代码段来获取一组整数范围:

    Set<Integer> iset = IntStream.rangeClosed(1, 5).boxed().collect
            (Collectors.toSet());

You can use the following code snippet in order to get a range set of integers:

    Set<Integer> iset = IntStream.rangeClosed(1, 5).boxed().collect
            (Collectors.toSet());

回答 6

public int[] range(int start, int length) {
    int[] range = new int[length - start + 1];
    for (int i = start; i <= length; i++) {
        range[i - start] = i;
    }
    return range;
}

(长回答只是说“不”)

public int[] range(int start, int length) {
    int[] range = new int[length - start + 1];
    for (int i = start; i <= length; i++) {
        range[i - start] = i;
    }
    return range;
}

(Long answer just to say “No”)


回答 7

Java 9- IntStream::iterate

从Java 9开始,您可以使用IntStream::iterate,甚至可以自定义步骤。例如,如果要int数组:

public static int[] getInRange(final int min, final int max, final int step) {
    return IntStream.iterate(min, i -> i < max, i -> i + step)
            .toArray();
}

List

public static List<Integer> getInRange(final int min, final int max, final int step) {
    return IntStream.iterate(min, i -> i < max, i -> i + step)
            .boxed()
            .collect(Collectors.toList());
}

然后使用它:

int[] range = getInRange(0, 10, 1);

Java 9 – IntStream::iterate

Since Java 9 you can use IntStream::iterate and you can even customize the step. For example if you want int array :

public static int[] getInRange(final int min, final int max, final int step) {
    return IntStream.iterate(min, i -> i < max, i -> i + step)
            .toArray();
}

or List :

public static List<Integer> getInRange(final int min, final int max, final int step) {
    return IntStream.iterate(min, i -> i < max, i -> i + step)
            .boxed()
            .collect(Collectors.toList());
}

And then use it :

int[] range = getInRange(0, 10, 1);

回答 8

Groovy的漂亮的Range类可以从Java中使用,尽管它当然不是Groovy的。

Groovy’s nifty Range class can be used from Java, though it’s certainly not as groovy.


回答 9

“ Functional Java”库允许以这种方式进行有限程度的编程,它具有创建fj.data.Array实例的range()方法。

看到:

同样,“完全懒惰”库提供了一种惰性范围方法:http : //code.google.com/p/totallylazy/

The “Functional Java” library allows to program in such a way to a limited degree, it has a range() method creating an fj.data.Array instance.

See:

Similarly the “Totally Lazy” library offers a lazy range method: http://code.google.com/p/totallylazy/


回答 10

如果您打算像在Python循环中那样使用它,则Java会使用for语句很好地循环,这使此结构不必要。

If you mean to use it like you would in a Python loop, Java loops nicely with the for statement, which renders this structure unnecessary for that purpose.


回答 11

IntStream.range(0, 10).boxed().collect(Collectors.toUnmodifiableList());
IntStream.range(0, 10).boxed().collect(Collectors.toUnmodifiableList());

回答 12

我知道这是一篇老文章,但是如果您正在寻找一种返回对象流并且不希望或不能使用任何其他依赖项的解决方案,请执行以下操作:

Stream.iterate(start, n -> n + 1).limit(stop);

开始 -包容性停止 – 包容性

I know this is an old post but if you are looking for a solution that returns an object stream and don’t want to or can’t use any additional dependencies:

Stream.iterate(start, n -> n + 1).limit(stop);

start – inclusive stop – exclusive


Antlr4-ANTLR 是一个功能强大的解析器生成器,用于读取、处理、执行或翻译结构化文本或二进制文件

ANTLR(另一个用于语言识别的工具)是一个功能强大的解析器生成器,用于读取、处理、执行或翻译结构化文本或二进制文件。它被广泛用于构建语言、工具和框架。根据语法,ANTLR生成一个解析器,该解析器可以构建解析树,还可以生成一个侦听器接口(或访问器),从而可以轻松地响应感兴趣的短语的识别

考虑到日间工作的限制,我在这个项目上的工作时间有限,因此我必须首先专注于修复bug,而不是更改/改进功能集。很可能我每隔几个月就会突然做一次。如果您的bug或Pull请求没有产生响应,请不要生气!–parrt

作者和主要贡献者

有用的信息

您可能还会发现以下页面很有用,特别是当您想要使用各种目标语言时

权威的ANTLR 4参考

程序员总是遇到解析问题。无论是JSON这样的数据格式,SMTP这样的网络协议,Apache的服务器配置文件,PostScript/PDF文件,还是简单的电子表格宏语言-ANTLR v4,本书都将揭开这个过程的神秘面纱。ANTLRv4已经从头开始重写,使得构建解析器和构建在其上的语言应用程序比以往任何时候都更加容易。这本完全改写的新版畅销ANTLR权威参考向您展示了如何利用这些新功能

你可以买这本书The Definitive ANTLR 4 Reference在亚马逊或electronic version at the publisher’s site

您会发现Book source code有用的

附加语法

This repository是不带动作的语法集合,其中根目录名是语法分析的语言的全小写名称。例如,java、cpp、cSharp、c等

在Google App Engine上选择Java vs Python

问题:在Google App Engine上选择Java vs Python

目前,Google App Engine同时支持Python和Java。Java支持还不成熟。但是,Java似乎具有更长的库列表,尤其是对Java字节码的支持,无论用于编写该代码的语言是什么。哪种语言将提供更好的性能和更多的功能?请指教。谢谢!

编辑: http : //groups.google.com/group/google-appengine-java/web/will-it-play-in-app-engine?pli=1

编辑: “能力”是指更好的可扩展性和框架外部可用库的包含。不过,Python仅允许使用纯Python库。

Currently Google App Engine supports both Python & Java. Java support is less mature. However, Java seems to have a longer list of libraries and especially support for Java bytecode regardless of the languages used to write that code. Which language will give better performance and more power? Please advise. Thank you!

Edit: http://groups.google.com/group/google-appengine-java/web/will-it-play-in-app-engine?pli=1

Edit: By “power” I mean better expandability and inclusion of available libraries outside the framework. Python allows only pure Python libraries, though.


回答 0

我有偏见(是一名Python专家,但对Java却很生疏),但我认为GAE的Python运行时目前比Java运行时更先进和更好地开发-毕竟前者还有一年多的开发和成熟时间。

事情的进展当然很难预测-Java方面的需求可能会更强(特别是因为它不仅与Java有关,而且其他语言也位于JVM之上,因此这是运行方式,例如PHP或App Engine上的Ruby代码);但是,Python App Engine团队的确拥有加入Python的发明者,非常强大的工程师Guido van Rossum的优势。

在灵活性方面,如前所述,Java引擎确实提供了运行由不同语言(不仅是Java)制作的JVM字节码的可能性-如果您在一家多语言商店中,那肯定是很大的。反之亦然,如果您讨厌Javascript但必须在用户的浏览器中执行一些代码,则Java的GWT(通过Java级编码为您生成Javascript)比Python端的替代方案(实际上,如果您选择Python,您将为此目的自己编写一些JS,而如果您选择Java GWT,则如果您讨厌编写JS,则可以使用它。

就库而言,这几乎是一种洗礼- JVM受到足够的限制(没有线程,没有自定义类加载器,没有JNI,没有关系数据库),以至于妨碍了现有Java库的简单重用,甚至超过了现有Python。类似地,Python运行时的类似限制也限制了库。

就性能而言,尽管您应该以自己的任务为基准,但我认为这是一种洗礼–不要依赖高度优化的基于JIT的JVM实现的性能,因为它们会浪费大量的启动时间和内存,因为应用引擎环境是非常不同的(启动成本将经常支付,因为您的应用实例被启动,停止,移动到其他主机等)对您来说都是透明的-与Python相比,使用Python运行时环境这些事件通常要便宜得多。

叹气,尽管我认为XPath / XSLT的情况在两侧都不是完美的,但我认为这在JVM中可能会稍微好一些(显然,可以使Saxon的实质子集运行) ,但要小心)。我认为值得在Appengine问题页面上以XPath和XSLT作为标题来打开问题-现在只有问题需要特定的库,这是近视的:我真的不在乎如何实现良好的XPath / XSLT,只要适用于Python和/或Java。(特定的库可能会简化现有代码的迁移,但是这比能够以某种方式执行“快速应用XSLT转换”这样的任务要重要!)。我知道如果措辞得当(尤其是以与语言无关的方式),我会盯上这样的问题。

最后但并非最不重要的一点:请记住,您可以拥有不同版本的应用程序(使用相同的数据存储),其中一些版本是通过Python运行时实现的,某些版本是通过Java运行时实现的,并且您可以访问不同于“默认/活动”的版本”带有明确的网址。所以你可以同时拥有Python Java代码(在应用的不同版本中)使用和修改同一数据存储,从而为您提供更大的灵活性(尽管只有一个拥有“ nice” URL,例如foobar.appspot.com)我想这可能仅对交互式用户在浏览器上的访问很重要;-)。

I’m biased (being a Python expert but pretty rusty in Java) but I think the Python runtime of GAE is currently more advanced and better developed than the Java runtime — the former has had one extra year to develop and mature, after all.

How things will proceed going forward is of course hard to predict — demand is probably stronger on the Java side (especially since it’s not just about Java, but other languages perched on top of the JVM too, so it’s THE way to run e.g. PHP or Ruby code on App Engine); the Python App Engine team however does have the advantage of having on board Guido van Rossum, the inventor of Python and an amazingly strong engineer.

In terms of flexibility, the Java engine, as already mentioned, does offer the possibility of running JVM bytecode made by different languages, not just Java — if you’re in a multi-language shop that’s a pretty large positive. Vice versa, if you loathe Javascript but must execute some code in the user’s browser, Java’s GWT (generating the Javascript for you from your Java-level coding) is far richer and more advanced than Python-side alternatives (in practice, if you choose Python, you’ll be writing some JS yourself for this purpose, while if you choose Java GWT is a usable alternative if you loathe writing JS).

In terms of libraries it’s pretty much a wash — the JVM is restricted enough (no threads, no custom class loaders, no JNI, no relational DB) to hamper the simple reuse of existing Java libraries as much, or more, than existing Python libraries are similarly hampered by the similar restrictions on the Python runtime.

In terms of performance, I think it’s a wash, though you should benchmark on tasks of your own — don’t rely on the performance of highly optimized JIT-based JVM implementations discounting their large startup times and memory footprints, because the app engine environment is very different (startup costs will be paid often, as instances of your app are started, stopped, moved to different hosts, etc, all trasparently to you — such events are typically much cheaper with Python runtime environments than with JVMs).

The XPath/XSLT situation (to be euphemistic…) is not exactly perfect on either side, sigh, though I think it may be a tad less bad in the JVM (where, apparently, substantial subsets of Saxon can be made to run, with some care). I think it’s worth opening issues on the Appengine Issues page with XPath and XSLT in their titles — right now there are only issues asking for specific libraries, and that’s myopic: I don’t really care HOW a good XPath/XSLT is implemented, for Python and/or for Java, as long as I get to use it. (Specific libraries may ease migration of existing code, but that’s less important than being able to perform such tasks as “rapidly apply XSLT transformation” in SOME way!-). I know I’d star such an issue if well phrased (especially in a language-independent way).

Last but not least: remember that you can have different version of your app (using the same datastore) some of which are implemented with the Python runtime, some with the Java runtime, and you can access versions that differ from the “default/active” one with explicit URLs. So you could have both Python and Java code (in different versions of your app) use and modify the same data store, granting you even more flexibility (though only one will have the “nice” URL such as foobar.appspot.com — which is probably important only for access by interactive users on browsers, I imagine;-).


回答 1

观看此应用,了解Python和Java性能的变化:

http://gaejava.appspot.com/ (编辑:道歉,链接现在已断开。但是,当我看到它最后运行时,以下para仍然适用)

目前,对于此简单测试,Python和在Java中使用低级API的速度比Java上的JDO快。至少如果基础引擎发生变化,则该应用应反映出性能变化。

Watch this app for changes in Python and Java performance:

http://gaejava.appspot.com/ (edit: apologies, link is broken now. But following para still applied when I saw it running last)

Currently, Python and using the low-level API in Java are faster than JDO on Java, for this simple test. At least if the underlying engine changes, that app should reflect performance changes.


回答 2

基于在其他平台上运行这些VM的经验,我想说Java可能会比Python带来更多原始性能。但是,请不要小看Python的卖点:Python语言在代码行方面的生产力要高得多-普遍的共识是,Python需要等效Java程序代码的三分之一,同时保持可读性或可读性。这种优势乘以无需立即进行编译步骤即可立即运行代码的能力。

关于可用的库,您会发现很多扩展的Python运行时库都是现成的(就像Java一样)。AppEngine还支持流行的Django Web框架(http://www.djangoproject.com/)。

关于“电源”,很难理解您的意思,但是Python已在许多不同的领域中使用,尤其是在Web上:YouTube是用Python编写的,Sourceforge也是如此(截至上周)。

Based on experience with running these VMs on other platforms, I’d say that you’ll probably get more raw performance out of Java than Python. Don’t underestimate Python’s selling points, however: The Python language is much more productive in terms of lines of code – the general agreement is that Python requires a third of the code of an equivalent Java program, while remaining as or more readable. This benefit is multiplied by the ability to run code immediately without an explicit compile step.

With regards to available libraries, you’ll find that much of the extensive Python runtime library works out of the box (as does Java’s). The popular Django Web framework (http://www.djangoproject.com/) is also supported on AppEngine.

With regards to ‘power’, it’s difficult to know what you mean, but Python is used in many different domains, especially the Web: YouTube is written in Python, as is Sourceforge (as of last week).


回答 3

2013年6月:此视频是Google工程师的一个很好的回答:

http://www.youtube.com/watch?v=tLriM2krw2E

TLDR;是:

  • 选择您和您的团队最能用的语言
  • 如果您想为生产而构建:Java或Python(而非Go)
  • 如果您有一个庞大的团队和一个复杂的代码库:Java(由于静态代码分析和重构)
  • 快速迭代的小型团队:Python(尽管Java也可以)

June 2013: This video is a very good answer by a google engineer:

http://www.youtube.com/watch?v=tLriM2krw2E

TLDR; is:

  • Pick the language that you and your team is most productive with
  • If you want to build something for production: Java or Python (not Go)
  • If you have a big team and a complex code base: Java (because of static code analysis and refactoring)
  • Small teams that iterate quickly: Python (although Java is also okay)

回答 4

在决定Python和Java之间要考虑的一个重要问题是如何使用每种语言的数据存储(本主题已经很好地涵盖了与原始问题的大多数其他角度)。

对于Java,标准方法是使用JDO或JPA。这些对于可移植性非常有用,但不适用于数据存储。

可以使用低级API,但是对于日常使用而言,该级别太低-它更适合于构建第三方库。

对于Python,有一个专门设计用于向应用程序提供对数据存储的轻松但强大的访问的API。很棒,但是它不是便携式的,因此将您锁定在GAE中。

幸运的是,已经针对两种语言列出的弱点开发了解决方案。

对于Java而言,低级API用于开发持久性库,该持久性库比JDO / JPA(IMO)更适合于数据存储。示例包括Siena项目Objectify

我最近开始使用Objectify,发现它非常易于使用并且非常适合数据存储,并且它的日益普及已经转化为良好的支持。例如,Google的新Cloud Endpoints服务正式支持Objectify。另一方面,Objectify仅适用于数据存储,而Siena受数据存储“启发”,但旨在与各种SQL数据库和NoSQL数据存储一起使用。

对于Python,正在努力允许在GAE之外使用Python GAE数据存储区API。一个示例是Google发布用于SDK的SQLite后端,但我怀疑他们是否打算将其发展为可用于生产的产品。该TyphoonAE项目可能有更多的潜力,但我不认为这是生产准备好了吗或者(纠正我,如果我错了)。

如果任何人有使用这些替代方法的经验或了解其他替代方法,请在评论中添加它们。就我个人而言,我真的很喜欢GAE数据存储-我发现它比AWS SimpleDB有了很大的改进-因此,我希望这些努力的成功能够减轻使用它的一些问题。

An important question to consider in deciding between Python and Java is how you will use the datastore in each language (and most other angles to the original question have already been covered quite well in this topic).

For Java, the standard method is to use JDO or JPA. These are great for portability but are not very well suited to the datastore.

A low-level API is available but this is too low level for day-to-day use – it is more suitable for building 3rd party libraries.

For Python there is an API designed specifically to provide applications with easy but powerful access to the datastore. It is great except that it is not portable so it locks you into GAE.

Fortunately, there are solutions being developed for the weaknesses listed for both languages.

For Java, the low-level API is being used to develop persistence libraries that are much better suited to the datastore then JDO/JPA (IMO). Examples include the Siena project, and Objectify.

I’ve recently started using Objectify and am finding it to be very easy to use and well suited to the datastore, and its growing popularity has translated into good support. For example, Objectify is officially supported by Google’s new Cloud Endpoints service. On the other hand, Objectify only works with the datastore, while Siena is ‘inspired’ by the datastore but is designed to work with a variety of both SQL databases and NoSQL datastores.

For Python, there are efforts being made to allow the use of the Python GAE datastore API off of the GAE. One example is the SQLite backend that Google released for use with the SDK, but I doubt they intend this to grow into something production ready. The TyphoonAE project probably has more potential, but I don’t think it is production ready yet either (correct me if I am wrong).

If anyone has experience with any of these alternatives or knows of others, please add them in a comment. Personally, I really like the GAE datastore – I find it to be a considerable improvement over the AWS SimpleDB – so I wish for the success of these efforts to alleviate some of the issues in using it.


回答 5

我强烈推荐Java for GAE,这是为什么:

  1. 性能:Java可能比Python快。
  2. Python开发面临缺乏第三方库的压力。例如,根本没有用于Python / GAE的XSLT。几乎所有的Python库都是C绑定(GAE不支持这些绑定)。
  3. Memcache API:Java SDK比Python SDK具有更多有趣的功能。
  4. 数据存储区API:JDO速度很慢,但是本机Java数据存储区API却非常快速和容易。

我现在在开发中使用Java / GAE。

I’m strongly recommending Java for GAE and here’s why:

  1. Performance: Java is potentially faster then Python.
  2. Python development is under pressure of a lack of third-party libraries. For example, there is no XSLT for Python/GAE at all. Almost all Python libraries are C bindings (and those are unsupported by GAE).
  3. Memcache API: Java SDK have more interesting abilities than Python SDK.
  4. Datastore API: JDO is very slow, but native Java datastore API is very fast and easy.

I’m using Java/GAE in development right now.


回答 6

如您所确定的,使用JVM并不限制您使用Java语言。JVM语言和链接的列表可以在这里找到。然而,Google App Engine确实限制了您可以从常规Java SE集中使用的类集,并且您将要研究是否可以在App Engine上使用这些实现中的任何一种。

编辑:我看到你已经找到了这样的列表

我无法评论Python的性能。但是,由于JVM具有在运行时动态编译和优化代码的能力,因此它是一个非常强大的平台性能。

最终,性能将取决于您的应用程序的工作方式以及编码方式。在没有更多信息的情况下,我认为无法在此区域中提供更多的指针。

As you’ve identified, using a JVM doesn’t restrict you to using the Java language. A list of JVM languages and links can be found here. However, the Google App Engine does restrict the set of classes you can use from the normal Java SE set, and you will want to investigate if any of these implementations can be used on the app engine.

EDIT: I see you’ve found such a list

I can’t comment on the performance of Python. However, the JVM is a very powerful platform performance-wise, given its ability to dynamically compile and optimise code during the run time.

Ultimately performance will depend on what your application does, and how you code it. In the absence of further info, I think it’s not possible to give any more pointers in this area.


回答 7

我对Python / Django SDK的干净,直接和无问题感到惊讶。但是,我开始遇到需要开始做更多JavaScript的情况,并认为我可能想利用GWT和其他Java实用程序。我只完成了GAE Java教程的一半,却又遇到了一个问题:Eclipse配置问题,JRE版本问题,Java令人麻木的复杂性以及一个令人困惑甚至可能损坏的教程。查看此站点以及从此处链接的其他站点对我来说很重要。我将回到Python,然后研究睡衣以帮助我应对JavaScript挑战。

I’ve been amazed at how clean, straightforward, and problem free the Python/Django SDK is. However I started running into situations where I needed to start doing more JavaScript and thought I might want to take advantage of the GWT and other Java utilities. I’ve gotten just half way through the GAE Java tutorial, and have had one problem after another: Eclipse configuration issues, JRE versionitis, the mind-numbing complexity of Java, and a confusing and possibly broken tutorial. Checking out this site and others linked from here clinched it for me. I’m going back to Python, and I’ll look into Pyjamas to help with my JavaScript challenges.


回答 8

我的谈话有点晚了,但这是我的两分钱。我真的很难在Python和Java之间进行选择,因为我精通两种语言。众所周知,两者都有优点和缺点,您必须考虑自己的要求和最适合您的项目的框架。

像我通常在这种困境中所做的那样,我会寻找数字来支持我的决定。我决定使用Python的原因有很多,但就我而言,有一个情节是临界点。如果您从2014年9月开始在GitHub上搜索“ Google App Engine” ,则会发现下图:

这些数字可能有很多偏差,但总的来说,GAE Python存储库的数量是GAE Java存储库的三倍。不仅如此,而且如果按“星数”列出项目,您会看到大多数Python项目都出现在顶部(您必须考虑到Python的使用时间更长)。对我来说,这对于Python来说是一个很好的例子,因为我考虑了社区的采用和支持,文档以及开源项目的可用性。

I’m a little late to the conversation, but here are my two cents. I really had a hard time choosing between Python and Java, since I am well versed in both languages. As we all know, there are advantages and disadvantages for both, and you have to take in account your requirements and the frameworks that work best for your project.

As I usually do in this type of dilemmas, I look for numbers to support my decision. I decided to go with Python for many reasons, but in my case, there was one plot that was the tipping point. If you search “Google App Engine” in GitHub as of September 2014, you will find the following figure:

There could be many biases in these numbers, but overall, there are three times more GAE Python repositories than GAE Java repositories. Not only that, but if you list the projects by the “number of stars” you will see that a majority of the Python projects appear at the top (you have to take in account that Python has been around longer). To me, this makes a strong case for Python because I take in account community adoption & support, documentation, and availability of open-source projects.


回答 9

这是一个很好的问题,我认为许多回应都很好地说明了围栏两侧的利弊。我已经尝试了基于Python和基于JVM的AppEngine(以我为例, Gaelyk,它是为AppEngine构建的Groovy应用程序框架)。当谈到平台的性能时,我一直没有想到的一件事就是它一直盯着我,这是在“ Java防护”方面的“加载请求”的含义。使用Groovy时,这些加载请求是致命的。

我在该主题上发表了一篇文章(http://distractable.net/coding/google-appengine-java-vs-python-performance-comparison/),希望找到解决此问题的方法,但是如果不是这样的话,我想我会回到Python + Django的组合,直到冷启动Java请求的影响减小为止。

It’s a good question, and I think many of the responses have given good view points of pros and cons on both sides of the fence. I’ve tried both Python and JVM-based AppEngine (in my case I was using Gaelyk which is a Groovy application framework built for AppEngine). When it comes to performance on the platform, one thing I hadn’t considered until it was staring me in the face is the implication of “Loading Requests” that occur on the Java side of the fence. When using Groovy these loading requests are a killer.

I put a post together on the topic (http://distractable.net/coding/google-appengine-java-vs-python-performance-comparison/) and I’m hoping to find a way of working around the problem, but if not I think I’ll be going back to a Python + Django combination until cold starting java requests has less of an impact.


回答 10

根据我听到的Java人与Python用户相比抱怨AppEngine的情况,我会说Python的使用压力要小得多。

Based on how much I hear Java people complain about AppEngine compared to Python users, I would say Python is much less stressful to use.


回答 11

还有一个项目Unladen Swallow,如果不是Google所有,显然是Google资助的。他们正在尝试为Python 2.6.1字节码实现基于LLVM的后端,因此他们可以使用JIT和各种不错的本机代码/ GC /多核优化。(很好的说法:“我们希望不做任何原始工作,而是尽可能多地利用过去30年的研究成果。”)他们正在寻求将CPython的速度提高5倍。

当然,这并不能回答您的紧迫问题,而是指向未来(希望)“缩小差距”(如果有)。

There’s also project Unladen Swallow, which is apparently Google-funded if not Google-owned. They’re trying to implement a LLVM-based backend for Python 2.6.1 bytecode, so they can use a JIT and various nice native code/GC/multi-core optimisations. (Nice quote: “We aspire to do no original work, instead using as much of the last 30 years of research as possible.”) They’re looking for a 5x speed-up to CPython.

Of course this doesn’t answer your immediate question, but points towards a “closing of the gap” (if any) in the future (hopefully).


回答 12

如今,python的魅力在于它与其他语言的交流程度。例如,您可以使用Jython将python和java放在同一表上。当然,即使jython完全支持Java库,它也不完全支持python库。但是,如果您想弄乱Java库,它是一个理想的解决方案。它甚至允许您将其与Java代码混合使用而无需额外的编码。

但是,即使python本身也已经采取了一些措施。例如,参见ctypes,接近C速度,直接访问C库,而无需离开python编码。Cython更进一步,允许轻松地将c代码与python代码混合,或者即使您不想弄乱c或c ++,您仍然可以在python中进行编码,但使用静态类型变量使python程序与C应用程序一样快。顺便说一下,Cython既被Google使用也受其支持。

昨天我什至发现了用于python内联C或Assembly的工具(请参阅CorePy),您将无法获得比它更强大的功能。

Python当然是一种非常成熟的语言,不仅可以独立存在,而且可以轻松地与任何其他语言配合。我认为,即使在非常高级和苛刻的情况下,这也是使python成为理想解决方案的原因。

使用python,您可以使用几乎零附加的代码访问C / C ++,Java,.NET和许多其他库,从而为您提供一种使代码最小化,简化和美化的语言。这是一种非常诱人的语言。

The beauty of python nowdays is how well it communicates with other languages. For instance you can have both python and java on the same table with Jython. Of course jython even though it fully supports java libraries it does not support fully python libraries. But its an ideal solution if you want to mess with Java Libraries. It even allows you to mix it with Java code with no extra coding.

But even python itself has made some steps forwared. See ctypes for example, near C speed , direct accees to C libraries all of this without leaving the comfort of python coding. Cython goes one step further , allowing to mix c code with python code with ease, or even if you dont want to mess with c or c++ , you can still code in python but use statically type variables making your python programms as fast as C apps. Cython is both used and supported by google by the way.

Yesterday I even found tools for python to inline C or even Assembly (see CorePy) , you cant get any more powerful than that.

Python is surely a very mature language, not only standing on itself , but able to coooperate with any other language with easy. I think that is what makes python an ideal solution even in a very advanced and demanding scenarios.

With python you can have acess to C/C++ ,Java , .NET and many other libraries with almost zero additional coding giving you also a language that minimises, simplifies and beautifies coding. Its a very tempting language.


回答 13

即使GWT对于我正在开发的应用程序来说似乎是一个完美的选择,但Python已经不复存在。JPA在GAE上非常混乱(例如,没有@Embeddable和其他晦涩的未记录限制)。花了一个星期的时间,我可以说Java暂时不适用于GAE。

Gone with Python even though GWT seems a perfect match for the kind of an app I’m developing. JPA is pretty messed up on GAE (e.g. no @Embeddable and other obscure non-documented limitations). Having spent a week, I can tell that Java just doesn’t feel right on GAE at the moment.


回答 14

要考虑的是您打算使用的框架。并非Java方面的所有框架都非常适合在App Engine上运行的应用程序,这与传统的Java应用程序服务器有些不同。

要考虑的一件事是应用程序启动时间。使用传统的Java Web应用程序,您实际上不需要考虑这一点。应用程序启动,然后运行。启动时间是5秒钟还是几分钟并不重要。使用App Engine,您可能会遇到仅在请求进入时才启动应用程序的情况。这意味着用户在应用程序启动时正在等待。GAE的新功能(如预留实例)在这里有帮助,但请先检查。

另一件事是GAE对Java的限制不同。并非所有的框架都对可以使用哪些类的限制或不允许使用线程或无法访问本地文件系统的事实感到满意。仅通过搜索GAE兼容性,可能很容易发现这些问题。

我还看到有些人抱怨现代UI框架(即Wicket)上的会话大小问题。通常,这些框架倾向于进行某些折衷,以使开发变得有趣,快速且容易。有时,这可能会导致与App Engine限制冲突。

我最初开始使用Java开发GAE,但由于这些原因,后来切换到Python。我个人的感觉认为Python是App Engine开发的更好选择。我认为Java在Amazon的Elastic Beanstalk上更“居家”。

但是使用App Engine的事情正在发生非常迅速的变化。GAE本身正在发生变化,并且随着它的日益流行,其框架也在发生变化以克服其局限性。

One think to take into account are the frameworks you intend yo use. Not all frameworks on Java side are well suited for applications running on App Engine, which is somewhat different than traditional Java app servers.

One thing to consider is the application startup time. With traditional Java web apps you don’t really need to think about this. The application starts and then it just runs. Doesn’t really matter if the startup takes 5 seconds or couple of minutes. With App Engine you might end up in a situation where the application is only started when a request comes in. This means the user is waiting while your application boots up. New GAE features like reserved instances help here, but check first.

Another thing are the different limitations GAE psoes on Java. Not all frameworks are happy with the limitations on what classes you can use or the fact that threads are not allowed or that you can’t access local filesystem. These issues are probably easy to find out by just googling about GAE compatibility.

I’ve also seen some people complaining about issues with session size on modern UI frameworks (Wicket, namely). In general these frameworks tend to do certain trade-offs in order to make development fun, fast and easy. Sometimes this may lead to conflicts with the App Engine limitations.

I initially started developing working on GAE with Java, but then switched to Python because of these reasons. My personal feeling is that Python is a better choice for App Engine development. I think Java is more “at home” for example on Amazon’s Elastic Beanstalk.

BUT with App Engine things are changing very rapidly. GAE is changing itself and as it becomes more popular, the frameworks are also changing to work around its limitations.