What blocks Ruby, Python to get Javascript V8 speed?
Nothing.
Well, okay: money. (And time, people, resources, but if you have money, you can buy those.)
V8 has a team of brilliant, highly-specialized, highly-experienced (and thus highly-paid) engineers working on it, that have decades of experience (I’m talking individually – collectively it’s more like centuries) in creating high-performance execution engines for dynamic OO languages. They are basically the same people who also created the Sun HotSpot JVM (among many others).
Lars Bak, the lead developer, has been literally working on VMs for 25 years (and all of those VMs have lead up to V8), which is basically his entire (professional) life. Some of the people writing Ruby VMs aren’t even 25 years old.
Are there any Ruby / Python features that are blocking implementation of optimizations (e.g. inline caching) V8 engine has?
Given that at least IronRuby, JRuby, MagLev, MacRuby and Rubinius have either monomorphic (IronRuby) or polymorphic inline caching, the answer is obviously no.
Modern Ruby implementations already do a great deal of optimizations. For example, for certain operations, Rubinius’s Hash class is faster than YARV’s. Now, this doesn’t sound terribly exciting until you realize that Rubinius’s Hash class is implemented in 100% pure Ruby, while YARV’s is implemented in 100% hand-optimized C.
So, at least in some cases, Rubinius can generate better code than GCC!
Or this is rather matter of resources put into the V8 project by Google.
Yes. Not just Google. The lineage of V8’s source code is 25 years old now. The people who are working on V8 also created the Self VM (to this day one of the fastest dynamic OO language execution engines ever created), the Animorphic Smalltalk VM (to this day one of the fastest Smalltalk execution engines ever created), the HotSpot JVM (the fastest JVM ever created, probably the fastest VM period) and OOVM (one of the most efficient Smalltalk VMs ever created).
In fact, Lars Bak, the lead developer of V8, worked on every single one of those, plus a few others.
There’s a lot more impetus to highly optimize JavaScript interpretors which is why we see so many resources being put into them between Mozilla, Google, and Microsoft. JavaScript has to be downloaded, parsed, compiled, and run in real time while a (usually impatient) human being is waiting for it, it has to run WHILE a person is interacting with it, and it’s doing this in an uncontrolled client-end environment that could be a computer, a phone, or a toaster. It HAS to be efficient in order to run under these conditions effectively.
Python and Ruby are run in an environment controlled by the developer/deployer. A beefy server or desktop system generally where the limiting factor will be things like memory or disk I/O and not execution time. Or where non-engine optimizations like caching can be utilized. For these languages it probably does make more sense to focus on language and library feature set over speed optimization.
The side benefit of this is that we have two great high performance open source JavaScript engines that can and are being re-purposed for all manner of applications such as Node.js.
关于技术细节,我对Ruby不太了解,但是Python在很多地方都可以使用优化功能(Google项目Unladen Swallow在开始努力之前就开始实现这些功能)。这是他们计划的一些优化。如果为CPython实现JIT la PyPy,我可以看到Python在将来获得V8的速度,但这在未来几年似乎不太可能(目前的重点是采用Python 3,而不是JIT)。
A good part of it has to do with community. Python and Ruby for the most part have no corporate backing. No one gets paid to work on Python and Ruby full-time (and they especially don’t get paid to work on CPython or MRI the whole time). V8, on the other hand, is backed by the most powerful IT company in the world.
Furthermore, V8 can be faster because the only thing that matters to the V8 people is the interpreter — they have no standard library to work on, no concerns about language design. They just write the interpreter. That’s it.
It has nothing to do with intellectual property law. Nor is Python co-developed by Google guys (its creator works there along with a few other committers, but they don’t get paid to work on Python).
Another obstacle to Python speed is Python 3. Its adoption seems to be the main concern of the language developers — to the point that they have frozen development of new language features until other implementations catch up.
On to the technical details, I don’t know much about Ruby, but Python has a number of places where optimizations could be used (and Unladen Swallow, a Google project, started to implement these before biting the dust). Here are some of the optimizations that they planned. I could see Python gaining V8 speed in the future if a JIT a la PyPy gets implemented for CPython, but that does not seem likely for the coming years (the focus right now is Python 3 adoption, not a JIT).
Many also feel that Ruby and Python could benefit immensely from removing their respective global interpreter locks.
You also have to understand that Python and Ruby are both much heavier languages than JS — they provide far more in the way of standard library, language features, and structure. The class system of object-orientation alone adds a great deal of weight (in a good way, I think). I almost think of Javascript as a language designed to be embedded, like Lua (and in many ways, they are similar). Ruby and Python have a much richer set of features, and that expressiveness is usually going to come at the cost of speed.
Performance doesn’t seem to be a major focus of the core Python developers, who seem to feel that “fast enough” is good enough, and that features that help programmers be more productive are more important than features that help computers run code faster.
Indeed, however, there was a (now abandoned) Google project, unladen-swallow, to produce a faster Python interpreter compatible with the standard interpreter. PyPy is another project that intends to produce a faster Python. There is also Psyco, the forerunner of PyPy, which can provide performance boosts to many Python scripts without changing out the whole interpreter, and Cython, which lets you write high-performance C libraries for Python using something very much like Python syntax.
Misleading question. V8 is a JIT (a just in time compiler) implementation of JavaScript and in its most popular non-browser implementation Node.js it is constructed around an event loop. CPython is not a JIT & not evented. But these exist in Python most commonly in the PyPy project – a CPython 2.7 (and soon to be 3.0+) compatible JIT. And there are loads of evented server libraries like Tornado for example. Real world tests exist between PyPy running Tornado vs Node.js and the performance differences are slight.
I just ran across this question and there is also a big technical reason for the performance difference that wasn’t mentioned. Python has a very large ecosystem of powerful software extensions, but most of these extensions are written in C or other low-level languages for performance and are heavily tied to the CPython API.
There are lots of well-known techniques (JIT, modern garbage collector, etc) that could be used to speed up the CPython implementation but all would require substantial changes to the API, breaking most of the extensions in the process. CPython would be faster, but a lot of what makes Python so attractive (the extensive software stack) would be lost. Case in point, there are several faster Python implementations out there but they have little traction compared to CPython.
Because of different design priorities and use case goals I believe.
In general main purpose of scripting (a.k.a. dynamic) languages is to be a “glue” between calls of native functions. And these native functions shall a) cover most critical/frequently used areas and b) be as effective as possible.
Here is an example:
jQuery sort causing iOS Safari to freeze
The freeze there is caused by excessive use of get-by-selector calls. If get-by-selector would be implemented in native code and effectively it will be no such problem at all.
Consider ray-tracer demo that is frequently used demo for V8 demonstration. In Python world it can be implemented in native code as Python provides all facilities for native extensions. But in V8 realm (client side sandbox) you have no other options rather than making VM to be [sub]effective as possible. And so the only option see ray-tracer implementation there is by using script code.
So different priorities and motivations.
In Sciter I’ve made a test by implementing pretty much full jQurey core natively. On practical tasks like ScIDE (IDE made of HTML/CSS/Script) I believe such solution works significantly better then any VM optimizations.
As other people have mentioned, Python has a performant JIT compiler in the form of PyPy.
Making meaningful benchmarks is always subtle, but I happen to have a simple benchmark of K-means written in different languages – you can find it here. One of the constraints was that the various languages should all implement the same algorithm and should strive to be simple and idiomatic (as opposed to optimized for speed). I have written all the implementations, so I know I have not cheated, although I cannot claim for all languages that what I have written is idiomatic (I only have a passing knowledge of some of those).
I do not claim any definitive conclusion, but PyPy was among the fastest implementations I got, far better than Node. CPython, instead, was at the slowest end of the ranking.
Also, there the problem of perceived performance : since V8 is natively non blocking, Web dev leads to more performant projects because you save the IO wait. And V8 is mainly used for dev Web where IO is key, so they compare it to similar projects. But you can use Python in many, many other areas than web dev. And you can even use C extensions for a lot of tasks, such as scientific computations or encryption, and crunch data with blazing perfs.
But on the web, most popular Python and Ruby projects are blocking. Python, especially, has the legacy of the synchronous WSGI standard, and frameworks like the famous Django are based on it.
You can write asynchronous Python (like with Twisted, Tornado, gevent or asyncio) or Ruby. But it’s not done often. The best tools are still blocking.
However, they are some reasons for why the default implementations in Ruby and Python are not as speedy as V8.
Experience
Like Jörg W Mittag pointed out, the guys working on V8 are VM geniuses. Python is dev by a bunch a passionate people, very good in a lot of domains, but are not as specialized in VM tuning.
Resources
The Python Software foundation has very little money : less than 40k in a year to invest in Python. This is kinda crazy when you think big players such as Google, Facebook or Apple are all using Python, but it’s the ugly truth : most work is done for free. The language that powers Youtube and existed before Java has been handcrafted by volunteers.
They are smart and dedicated volunteers, but when they identify they need more juice in a field, they can’t ask for 300k to hire a top notch specialist for this area of expertise. They have to look around for somebody who would do it for free.
While this works, it means you have to be very a careful about your priorities. Hence, now we need to look at :
Objectives
Even with the latest modern features, writing Javascript is terrible. You have scoping issues, very few collections, terrible string and array manipulation, almost no stdlist apart from date, maths and regexes, and no syntactic sugar even for very common operations.
But in V8, you’ve got speed.
This is because, speed was the main objective for Google, since it’s a bottleneck for page rendering in Chrome.
In Python, usability is the main objective. Because it’s almost never the bottleneck on the project. The scarce resource here is developer time. It’s optimized for the developer.
Because JavaScript implementations need not care about backwards compatibility of their bindings.
Until recently the only users of the JavaScript implementations have been web browsers. Due to security requirements, only the web browser vendors had the privilege to extend the functionality by writing bindings to the runtimes. Thus there was no need keep the C API of the bindings backwards compatible, it was permissible to request the web browser developers update their source code as the JavaScript runtimes evolved; they were working together anyways. Even V8, which was a latecomer to the game, and also lead by a very very experienced developer, have changed the API as it became better.
OTOH Ruby is used (mainly) on the server-side. Many popular ruby extensions are written as C bindings (consider an RDBMS driver). In other words, Ruby would have never succeeded without maintaining the compatibility.
Today, the difference still exist to some extent. Developers using node.js are complaining that it is hard to keep their native extensions backwards compatible, as V8 changes the API over time (and that is one of the reasons node.js has been forked). IIRC ruby is still taking a much more conservative approach in this respect.
V8 is fast due to the JIT, Crankshaft, the type inferencer and data-optimized code. Tagged pointers, NaN-tagging of doubles.
And of course it does normal compiler optimizations in the middle.
The plain ruby, python and perl engines don’t do neither of the those, just minor basic optimizations.
The only major vm which comes close is luajit, which doesn’t even do type inference, constant folding, NaN-tagging nor integers, but uses similar small code and data structures, not as fat as the bad languages.
And my prototype dynamic languages, potion and p2 have similar features as luajit, and outperform v8. With an optional type system, “gradual typing”, you could easily outperform v8, as you can bypass crankshaft. See dart.
The known optimized backends, like pypy or jruby still suffer from various over-engineering techniques.
I’ve been testing out Selenium with Chromedriver and I noticed that some pages can detect that you’re using Selenium even though there’s no automation at all. Even when I’m just browsing manually just using chrome through Selenium and Xephyr I often get a page saying that suspicious activity was detected. I’ve checked my user agent, and my browser fingerprint, and they are all exactly identical to the normal chrome browser.
When I browse to these sites in normal chrome everything works fine, but the moment I use Selenium I’m detected.
In theory chromedriver and chrome should look literally exactly the same to any webserver, but somehow they can detect it.
If you browse around stubhub you’ll get redirected and ‘blocked’ within one or two requests. I’ve been investigating this and I can’t figure out how they can tell that a user is using Selenium.
How do they do it?
EDIT UPDATE:
I installed the Selenium IDE plugin in Firefox and I got banned when I went to stubhub.com in the normal firefox browser with only the additional plugin.
EDIT:
When I use Fiddler to view the HTTP requests being sent back and forth I’ve noticed that the ‘fake browser\’s’ requests often have ‘no-cache’ in the response header.
You can use vim, or as @Vic Seedoubleyew has pointed out in the answer by @Erti-Chris Eelmaa, perl, to replace the cdc_ variable in chromedriver(See post by @Erti-Chris Eelmaa to learn more about that variable). Using vim or perl prevents you from having to recompile source code or use a hex-editor. Make sure to make a copy of the original chromedriver before attempting to edit it. Also, the methods below were tested on chromedriver version 2.41.578706.
Using Vim
vim /path/to/chromedriver
After running the line above, you’ll probably see a bunch of gibberish. Do the following:
Search for cdc_ by typing /cdc_ and pressing return.
Enable editing by pressing a.
Delete any amount of $cdc_lasutopfhvcZLmcfl and replace what was deleted with an equal amount characters. If you don’t, chromedriver will fail.
After you’re done editing, press esc.
To save the changes and quit, type :wq! and press return.
If you don’t want to save the changes, but you want to quit, type :q! and press return.
You’re done.
Go to the altered chromedriver and double click on it. A terminal window should open up. If you don’t see killed in the output, you successfully altered the driver.
Using Perl
The line below replaces cdc_ with dog_:
perl -pi -e 's/cdc_/dog_/g' /path/to/chromedriver
Make sure that the replacement string has the same number of characters as the search string, otherwise the chromedriver will fail.
Perl Explanation
s///g denotes that you want to search for a string and replace it globally with another string (replaces all occurrences).
e.g., s/string/replacment/g
So,
s/// denotes searching for and replacing a string.
cdc_ is the search string.
dog_ is the replacement string.
g is the global key, which replaces every occurrence of the string.
How to check if the Perl replacement worked
The following line will print every occurrence of the search string cdc_:
to see if your replacement string, dog_, is now in the chromedriver binary. If it is, the replacement string will be printed to the console.
Go to the altered chromedriver and double click on it. A terminal window should open up. If you don’t see killed in the output, you successfully altered the driver.
Wrapping Up
After altering the chromedriver binary, make sure that the name of the altered chromedriver binary is chromedriver, and that the original binary is either moved from its original location or renamed.
My Experience With This Method
I was previously being detected on a website while trying to log in, but after replacing cdc_ with an equal sized string, I was able to log in. Like others have said though, if you’ve already been detected, you might get blocked for a plethora of other reasons even after using this method. So you may have to try accessing the site that was detecting you using a VPN, different network, or what have you.
Basically the way the selenium detection works, is that they test for pre-defined javascript variables which appear when running with selenium. The bot detection scripts usually look anything containing word “selenium” / “webdriver” in any of the variables (on window object), and also document variables called $cdc_ and $wdc_. Of course, all of this depends on which browser you are on. All the different browsers expose different things.
For me, I used chrome, so, all that I had to do was to ensure that $cdc_ didn’t exist anymore as document variable, and voila (download chromedriver source code, modify chromedriver and re-compile $cdc_ under different name.)
this is the function I modified in chromedriver:
call_function.js:
function getPageCache(opt_doc) {
var doc = opt_doc || document;
//var key = '$cdc_asdjflasutopfhvcZLmcfl_';
var key = 'randomblabla_';
if (!(key in doc))
doc[key] = new Cache();
return doc[key];
}
(note the comment, all I did I turned $cdc_ to randomblabla_.
Here is a pseudo-code which demonstrates some of the techniques that bot networks might use:
runBotDetection = function () {
var documentDetectionKeys = [
"__webdriver_evaluate",
"__selenium_evaluate",
"__webdriver_script_function",
"__webdriver_script_func",
"__webdriver_script_fn",
"__fxdriver_evaluate",
"__driver_unwrapped",
"__webdriver_unwrapped",
"__driver_evaluate",
"__selenium_unwrapped",
"__fxdriver_unwrapped",
];
var windowDetectionKeys = [
"_phantom",
"__nightmare",
"_selenium",
"callPhantom",
"callSelenium",
"_Selenium_IDE_Recorder",
];
for (const windowDetectionKey in windowDetectionKeys) {
const windowDetectionKeyValue = windowDetectionKeys[windowDetectionKey];
if (window[windowDetectionKeyValue]) {
return true;
}
};
for (const documentDetectionKey in documentDetectionKeys) {
const documentDetectionKeyValue = documentDetectionKeys[documentDetectionKey];
if (window['document'][documentDetectionKeyValue]) {
return true;
}
};
for (const documentKey in window['document']) {
if (documentKey.match(/\$[a-z]dc_/) && window['document'][documentKey]['cache_']) {
return true;
}
}
if (window['external'] && window['external'].toString() && (window['external'].toString()['indexOf']('Sequentum') != -1)) return true;
if (window['document']['documentElement']['getAttribute']('selenium')) return true;
if (window['document']['documentElement']['getAttribute']('webdriver')) return true;
if (window['document']['documentElement']['getAttribute']('driver')) return true;
return false;
};
according to user @szx, it is also possible to simply open chromedriver.exe in hex editor, and just do the replacement manually, without actually doing any compiling.
As we’ve already figured out in the question and the posted answers, there is an anti Web-scraping and a Bot detection service called “Distil Networks” in play here. And, according to the company CEO’s interview:
Even though they can create new bots, we figured out a way to identify
Selenium the a tool they’re using, so we’re blocking Selenium no
matter how many times they iterate on that bot. We’re doing that now
with Python and a lot of different technologies. Once we see a pattern
emerge from one type of bot, then we work to reverse engineer the
technology they use and identify it as malicious.
It’ll take time and additional challenges to understand how exactly they are detecting Selenium, but what can we say for sure at the moment:
it’s not related to the actions you take with selenium – once you navigate to the site, you get immediately detected and banned. I’ve tried to add artificial random delays between actions, take a pause after the page is loaded – nothing helped
it’s not about browser fingerprint either – tried it in multiple browsers with clean profiles and not, incognito modes – nothing helped
since, according to the hint in the interview, this was “reverse engineering”, I suspect this is done with some JS code being executed in the browser revealing that this is a browser automated via selenium webdriver
Decided to post it as an answer, since clearly:
Can a website detect when you are using selenium with chromedriver?
Yes.
Also, what I haven’t experimented with is older selenium and older browser versions – in theory, there could be something implemented/added to selenium at a certain point that Distil Networks bot detector currently relies on. Then, if this is the case, we might detect (yeah, let’s detect the detector) at what point/version a relevant change was made, look into changelog and changesets and, may be, this could give us more information on where to look and what is it they use to detect a webdriver-powered browser. It’s just a theory that needs to be tested.
So I used reverse engineering and obfuscated the js files by Hex editing. Now i was sure that no more javascript variable, function names and fixed strings were used to uncover selenium activity. But still some sites and reCaptcha detect selenium!
Maybe they check the modifications that are caused by chromedriver js execution :)
Edit 1:
Chrome ‘navigator’ parameters modification
I discovered there are some parameters in ‘navigator’ that briefly uncover using of chromedriver.
These are the parameters:
“navigator.webdriver” On non-automated mode it is ‘undefined’. On automated mode it’s ‘true’.
“navigator.plugins” On headless chrome has 0 length. So I added some fake elements to fool the plugin length checking process.
“navigator.languages” was set to default chrome value ‘[“en-US”, “en”, “es”]’ .
So what i needed was a chrome extension to run javascript on the web pages. I made an extension with the js code provided in the article and used another article to add the zipped extension to my project. I have successfully changed the values; But still nothing changed!
I didn’t find other variables like these but it doesn’t mean that they don’t exist. Still reCaptcha detects chromedriver, So there should be more variables to change. The next step should be reverse engineering of the detector services that i don’t want to do.
Now I’m not sure does it worth to spend more time on this automation process or search for alternative methods!
Try to use selenium with a specific user profile of chrome, That way you can use it as specific user and define any thing you want, When doing so it will run as a ‘real’ user, look at chrome process with some process explorer and you’ll see the difference with the tags.
For example:
username = os.getenv("USERNAME")
userProfile = "C:\\Users\\" + username + "\\AppData\\Local\\Google\\Chrome\\User Data\\Default"
options = webdriver.ChromeOptions()
options.add_argument("user-data-dir={}".format(userProfile))
# add here any tag you want.
options.add_experimental_option("excludeSwitches", ["ignore-certificate-errors", "safebrowsing-disable-download-protection", "safebrowsing-disable-auto-update", "disable-client-side-phishing-detection"])
chromedriver = "C:\Python27\chromedriver\chromedriver.exe"
os.environ["webdriver.chrome.driver"] = chromedriver
browser = webdriver.Chrome(executable_path=chromedriver, chrome_options=options)
The webdriver IDL attribute of the Navigator interface must return the value of the webdriver-active flag, which is initially false.
This property allows websites to determine that the user agent is under control by WebDriver, and can be used to help mitigate denial-of-service attacks.
Taken directly from the 2017 W3C Editor’s Draft of WebDriver. This heavily implies that at the very least, future iterations of selenium’s drivers will be identifiable to prevent misuse. Ultimately, it’s hard to tell without the source code, what exactly causes chrome driver in specific to be detectable.
Firefox is said to set window.navigator.webdriver === true if working with a webdriver. That was according to one of the older specs (e.g.: archive.org) but I couldn’t find it in the new one except for some very vague wording in the appendices.
A test for it is in the selenium code in the file fingerprint_test.js where the comment at the end says “Currently only implemented in firefox” but I wasn’t able to identify any code in that direction with some simple greping, neither in the current (41.0.2) Firefox release-tree nor in the Chromium-tree.
I also found a comment for an older commit regarding fingerprinting in the firefox driver b82512999938 from January 2015. That code is still in the Selenium GIT-master downloaded yesterday at javascript/firefox-driver/extension/content/server.js with a comment linking to the slightly differently worded appendix in the current w3c webdriver spec.
Additionally to the great answer of @Erti-Chris Eelmaa – there’s annoying window.navigator.webdriver and it is read-only. Event if you change the value of it to false it will still have true. Thats why the browser driven by automated software can still be detected.
MDN
The variable is managed by the flag --enable-automation in chrome. The chromedriver launches chrome with that flag and chrome sets the window.navigator.webdriver to true. You can find it here. You need to add to “exclude switches” the flag. For instance (golang):
It sounds like they are behind a web application firewall. Take a look at modsecurity and owasp to see how those work. In reality, what you are asking is how to do bot detection evasion. That is not what selenium web driver is for. It is for testing your web application not hitting other web applications. It is possible, but basically, you’d have to look at what a WAF looks for in their rule set and specifically avoid it with selenium if you can. Even then, it might still not work because you don’t know what WAF they are using. You did the right first step, that is faking the user agent. If that didn’t work though, then a WAF is in place and you probably need to get more tricky.
Edit:
Point taken from other answer. Make sure your user agent is actually being set correctly first. Maybe have it hit a local web server or sniff the traffic going out.
Even if you are sending all the right data (e.g. Selenium doesn’t show up as an extension, you have a reasonable resolution/bit-depth, &c), there are a number of services and tools which profile visitor behaviour to determine whether the actor is a user or an automated system.
For example, visiting a site then immediately going to perform some action by moving the mouse directly to the relevant button, in less than a second, is something no user would actually do.
It might also be useful as a debugging tool to use a site such as https://panopticlick.eff.org/ to check how unique your browser is; it’ll also help you verify whether there are any specific parameters that indicate you’re running in Selenium.
The bot detection I’ve seen seems more sophisticated or at least different than what I’ve read through in the answers below.
EXPERIMENT 1:
I open a browser and web page with Selenium from a Python console.
The mouse is already at a specific location where I know a link will appear once the page loads. I never move the mouse.
I press the left mouse button once (this is necessary to take focus from the console where Python is running to the browser).
I press the left mouse button again (remember, cursor is above a given link).
The link opens normally, as it should.
EXPERIMENT 2:
As before, I open a browser and the web page with Selenium from a Python console.
This time around, instead of clicking with the mouse, I use Selenium (in the Python console) to click the same element with a random offset.
The link doesn’t open, but I am taken to a sign up page.
IMPLICATIONS:
opening a web browser via Selenium doesn’t preclude me from appearing human
moving the mouse like a human is not necessary to be classified as human
clicking something via Selenium with an offset still raises the alarm
Seems mysterious, but I guess they can just determine whether an action originates from Selenium or not, while they don’t care whether the browser itself was opened via Selenium or not. Or can they determine if the window has focus? Would be interesting to hear if anyone has any insights.
chromeOptions.addArguments("--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36");
One more thing I found is that some websites uses a platform that checks the User Agent. If the value contains: “HeadlessChrome” the behavior can be weird when using headless mode.
The workaround for that will be to override the user agent value, for example in Java:
chromeOptions.addArguments("--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36");
回答 13
一些站点正在检测到此:
function d(){try{if(window.document.$cdc_asdjflasutopfhvcZLmcfl_.cache_)return!0}catch(e){}try{//if (window.document.documentElement.getAttribute(decodeURIComponent("%77%65%62%64%72%69%76%65%72")))if(window.document.documentElement.getAttribute("webdriver"))return!0}catch(e){}try{//if (decodeURIComponent("%5F%53%65%6C%65%6E%69%75%6D%5F%49%44%45%5F%52%65%63%6F%72%64%65%72") in window)if("_Selenium_IDE_Recorder"in window)return!0}catch(e){}try{//if (decodeURIComponent("%5F%5F%77%65%62%64%72%69%76%65%72%5F%73%63%72%69%70%74%5F%66%6E") in document)if("__webdriver_script_fn"in document)return!0}catch(e){}
I’ve found changing the javascript “key” variable like this:
//Fools the website into believing a human is navigating it
((JavascriptExecutor)driver).executeScript("window.key = \"blahblah\";");
works for some websites when using Selenium Webdriver along with Google Chrome, since many sites check for this variable in order to avoid being scrapped by Selenium.
It seems to me the simplest way to do it with Selenium is to intercept the XHR that sends back the browser fingerprint.
But since this is a Selenium-only problem, its better just to use something else. Selenium is supposed to make things like this easier, not way harder.
回答 17
您可以尝试使用参数“启用自动化”
var options =newChromeOptions();// hide selenium
options.AddExcludedArguments(newList<string>(){"enable-automation"});var driver =newChromeDriver(ChromeDriverService.CreateDefaultService(), options);
You can try to use the parameter “enable-automation”
var options = new ChromeOptions();
// hide selenium
options.AddExcludedArguments(new List<string>() { "enable-automation" });
var driver = new ChromeDriver(ChromeDriverService.CreateDefaultService(), options);
But, I want to warn that this ability was fixed in ChromeDriver 79.0.3945.16.
So probably you should use older versions of chrome.
Also, as another option, you can try using InternetExplorerDriver instead of Chrome. As for me, IE does not block at all without any hacks.
I want to send a datetime.datetime object in serialized form from Python using JSON and de-serialize in JavaScript using JSON. What is the best way to do this?
def handler(obj):if hasattr(obj,'isoformat'):return obj.isoformat()elif isinstance(obj,...):return...else:raiseTypeError,'Object of type %s with value of %s is not JSON serializable'%(type(obj), repr(obj))
def handler(obj):
if hasattr(obj, 'isoformat'):
return obj.isoformat()
elif isinstance(obj, ...):
return ...
else:
raise TypeError, 'Object of type %s with value of %s is not JSON serializable' % (type(obj), repr(obj))
Update: Added output of type as well as value.
Update: Also handle date
For cross-language projects, I found out that strings containing RfC 3339 dates are the best way to go. An RfC 3339 date looks like this:
1985-04-12T23:20:50.52Z
I think most of the format is obvious. The only somewhat unusual thing may be the “Z” at the end. It stands for GMT/UTC. You could also add a timezone offset like +02:00 for CEST (Germany in summer). I personally prefer to keep everything in UTC until it is displayed.
For displaying, comparisons and storage you can leave it in string format across all languages. If you need the date for calculations easy to convert it back to a native date object in most language.
Unfortunately, Javascript’s Date constructor doesn’t accept RfC 3339 strings but there are many parsers available on the Internet.
huTools.hujson tries to handle the most common encoding issues you might come across in Python code including date/datetime objects while handling timezones correctly.
Let’s say you have a Python datetime object, d, created with datetime.now(). Its value is:
datetime.datetime(2011, 5, 25, 13, 34, 5, 787000)
You can serialize it to JSON as an ISO 8601 datetime string:
import json
json.dumps(d.isoformat())
The example datetime object would be serialized as:
'"2011-05-25T13:34:05.787000"'
This value, once received in the Javascript layer, can construct a Date object:
var d = new Date("2011-05-25T13:34:05.787000");
As of Javascript 1.8.5, Date objects have a toJSON method, which returns a string in a standard format. To serialize the above Javascript object back to JSON, therefore, the command would be:
d.toJSON()
Which would give you:
'2011-05-25T20:34:05.787Z'
This string, once received in Python, could be deserialized back to a datetime object:
Here’s a fairly complete solution for recursively encoding and decoding datetime.datetime and datetime.date objects using the standard library json module. This needs Python >= 2.6 since the %f format code in the datetime.datetime.strptime() format string is only supported in since then. For Python 2.5 support, drop the %f and strip the microseconds from the ISO date string before trying to convert it, but you’ll loose microseconds precision, of course. For interoperability with ISO date strings from other sources, which may include a time zone name or UTC offset, you may also need to strip some parts of the date string before the conversion. For a complete parser for ISO date strings (and many other date formats) see the third-party dateutil module.
Decoding only works when the ISO date strings are values in a JavaScript
literal object notation or in nested structures within an object. ISO date
strings, which are items of a top-level array will not be decoded.
Now, you can use json.dumps() as if it had always supported datetime…
json.dumps({'created':datetime.datetime.now()})
This makes sense if you require this extension to the json module to always kick in and wish to not change the way you or others use json serialization (either in existing code or not).
Note that some may consider patching libraries in that way as bad practice.
Special care need to be taken in case you may wish to extend your application in more than one way – is such a case, I suggest to use the solution by ramen or JT and choose the proper json extension in each case.
回答 7
除了时间戳,没有什么可添加到社区Wiki答案中了!
Javascript使用以下格式:
newDate().toJSON()// "2016-01-08T19:00:00.123Z"
Python端(有关json.dumps处理程序,请参见其他答案):
>>>from datetime import datetime
>>> d = datetime.strptime('2016-01-08T19:00:00.123Z','%Y-%m-%dT%H:%M:%S.%fZ')>>> d
datetime.datetime(2016,1,8,19,0,0,123000)>>> d.isoformat()+'Z''2016-01-08T19:00:00.123000Z'
import time, json
from datetime import datetime as dt
your_date = dt.now()
data = json.dumps(time.mktime(your_date.timetuple())*1000)return data # data send to javascript
import time, json
from datetime import datetime as dt
your_date = dt.now()
data = json.dumps(time.mktime(your_date.timetuple())*1000)
return data # data send to javascript
Apparently The “right” JSON (well JavaScript) date format is 2012-04-23T18:25:43.511Z – UTC and “Z”. Without this JavaScript will use the web browser’s local timezone when creating a Date() object from the string.
For a “naive” time (what Python calls a time with no timezone and this assumes is local) the below will force local timezone so that it can then be correctly converted to UTC:
def default(obj):
if hasattr(obj, "json") and callable(getattr(obj, "json")):
return obj.json()
if hasattr(obj, "isoformat") and callable(getattr(obj, "isoformat")):
# date/time objects
if not obj.utcoffset():
# add local timezone to "naive" local time
# https://stackoverflow.com/questions/2720319/python-figure-out-local-timezone
tzinfo = datetime.now(timezone.utc).astimezone().tzinfo
obj = obj.replace(tzinfo=tzinfo)
# convert to UTC
obj = obj.astimezone(timezone.utc)
# strip the UTC offset
obj = obj.replace(tzinfo=None)
return obj.isoformat() + "Z"
elif hasattr(obj, "__str__") and callable(getattr(obj, "__str__")):
return str(obj)
else:
print("obj:", obj)
raise TypeError(obj)
def dump(j, io):
json.dump(j, io, indent=2, default=default)
For the Python to JavaScript date conversion, the date object needs to be in specific ISO format, i.e. ISO format or UNIX number. If the ISO format lacks some info, then you can convert to the Unix number with Date.parse first. Moreover, Date.parse works with React as well while new Date might trigger an exception.
In case you have a DateTime object without milliseconds, the following needs to be considered. :
var unixDate = Date.parse('2016-01-08T19:00:00')
var desiredDate = new Date(unixDate).toLocaleDateString();
The example date could equally be a variable in the result.data object after an API call.
For options to display the date in the desired format (e.g. to display long weekdays) check out the MDN doc.
cd impls/haxe
# Neko
make all-neko
neko ./stepX_YYY.n
# Python
make all-python
python3 ./stepX_YYY.py
# C++
make all-cpp
./cpp/stepX_YYY
# JavaScript
make all-js
node ./stepX_YYY.js
干草
MAL的Hy实现已经用Hy 0.13.0进行了测试
cd impls/hy
./stepX_YYY.hy
IO
已使用IO版本20110905测试了MAL的IO实现
cd impls/io
io ./stepX_YYY.io
珍妮特
MAIL的Janet实现已经使用Janet版本1.12.2进行了测试
cd impls/janet
janet ./stepX_YYY.janet
Java 1.7
mal的Java实现需要maven2来构建
cd impls/java
mvn compile
mvn -quiet exec:java -Dexec.mainClass=mal.stepX_YYY
# OR
mvn -quiet exec:java -Dexec.mainClass=mal.stepX_YYY -Dexec.args="CMDLINE_ARGS"
Java,将Truffle用于GraalVM
这个Java实现可以在OpenJDK上运行,但是多亏了Truffle框架,它在GraalVM上的运行速度可以提高30倍。它已经在OpenJDK 11、GraalVM CE 20.1.0和GraalVM CE 21.1.0上进行了测试
cd impls/java-truffle
./gradlew build
STEP=stepX_YYY ./run
JavaScript/节点
cd impls/js
npm install
node stepX_YYY.js
朱莉娅
Mal的Julia实现需要Julia 0.4
cd impls/julia
julia stepX_YYY.jl
JQ
针对1.6版进行了测试,IO部门存在大量作弊行为
cd impls/jq
STEP=stepA_YYY ./run
# with Debug
DEBUG=true STEP=stepA_YYY ./run
科特林
MAL的Kotlin实现已经使用Kotlin 1.0进行了测试
cd impls/kotlin
make
java -jar stepX_YYY.jar
LiveScript
已使用LiveScript 1.5测试了mal的LiveScript实现
cd impls/livescript
make
node_modules/.bin/lsc stepX_YYY.ls
徽标
MAL的Logo实现已经用UCBLogo 6.0进行了测试
cd impls/logo
logo stepX_YYY.lg
路亚
Mal的Lua实现已经使用Lua 5.3.5进行了测试。该实现需要安装luarock
cd impls/lua
make # to build and link linenoise.so and rex_pcre.so
./stepX_YYY.lua
cd impls/miniMAL
# Download miniMAL and dependencies
npm install
export PATH=`pwd`/node_modules/minimal-lisp/:$PATH
# Now run mal implementation in miniMAL
miniMAL ./stepX_YYY
make MAL_IMPL=IMPL "test^mal^step2"
# e.g.
make "test^mal^step2" # js is default
make MAL_IMPL=ruby "test^mal^step2"
make MAL_IMPL=python "test^mal^step2"
启动REPL
要在特定步骤中启动实施的REPL,请执行以下操作:
make "repl^IMPL^stepX"
# e.g
make "repl^ruby^step3"
make "repl^ps^step4"
如果您省略了这一步,那么stepA使用的是:
make "repl^IMPL"
# e.g
make "repl^ruby"
make "repl^ps"
------------------------------------- ---------------------------------
| Security Testing || Social-Analyzer |
------------------------------------- ---------------------------------
| Passive Information Gathering |<-->| Find Social Media Profiles |||||| Active Information Gathering |<-->| Post Analysis Activities |
------------------------------------- ---------------------------------
sudo apt-get update
#Depedning on your Linux distro, you may or may not need these 2 lines
sudo DEBIAN_FRONTEND=noninteractive apt-get install -y software-properties-common
sudo add-apt-repository ppa:mozillateam/ppa -y
sudo apt-get install -y firefox-esr tesseract-ocr git nodejs npm
git clone https://github.com/qeeqbox/social-analyzer.git
cd social-analyzer
npm install
npm start
Required Arguments:
--username E.g. johndoe, john_doe or johndoe9999
Optional Arguments:
--websites Website or websites separated by space E.g. youtube, tiktok or tumblr
--mode Analysis mode E.g.fast -> FindUserProfilesFast, slow -> FindUserProfilesSlow or special -> FindUserProfilesSpecial
--output Show the output in the following format: json -> json output for integration or pretty -> prettify the output
--options Show the following when a profile is found: link, rate, title or text
--method find -> show detected profiles, get -> show all profiles regardless detected or not, both -> combine find & get
--filter Filter detected profiles by good, maybe or bad, you can do combine them with comma (good,bad) or use all
--profiles Filter profiles by detected, unknown or failed, you can do combine them with comma (detected,failed) or use all
--extract Extract profiles, urls & patterns if possible
--metadata Extract metadata if possible (pypi QeeqBox OSINT)
--trim Trim long strings
Listing websites & detections:
--list List all available websites
Setting:
--headers Headers as dict
--logs_dir Change logs directory
--timeout Change timeout between each request
--silent Disable output to screen
CLI Options:
-f, --file Input file(s) (Pass '-' for stdin)
-r, --replace Write output in-place, replacing input
-o, --outfile Write output to file (default stdout)
--config Path to config file
--type [js|css|html] ["js"] Select beautifier type (NOTE: Does *not* filter files, only defines which beautifier type to run)
-q, --quiet Suppress logging to stdout
-h, --help Show this help
-v, --version Show the version
Beautifier Options:
-s, --indent-size Indentation size [4]
-c, --indent-char Indentation character [" "]
-t, --indent-with-tabs Indent with tabs, overrides -s and -c
-e, --eol Character(s) to use as line terminators.
[first newline in file, otherwise "\n]
-n, --end-with-newline End output with newline
--editorconfig Use EditorConfig to set up the options
-l, --indent-level Initial indentation level [0]
-p, --preserve-newlines Preserve line-breaks (--no-preserve-newlines disables)
-m, --max-preserve-newlines Number of line-breaks to be preserved in one chunk [10]
-P, --space-in-paren Add padding spaces within paren, ie. f( a, b )
-E, --space-in-empty-paren Add a single space inside empty paren, ie. f( )
-j, --jslint-happy Enable jslint-stricter mode
-a, --space-after-anon-function Add a space before an anonymous function's parens, ie. function ()
--space-after-named-function Add a space before a named function's parens, i.e. function example ()
-b, --brace-style [collapse|expand|end-expand|none][,preserve-inline] [collapse,preserve-inline]
-u, --unindent-chained-methods Don't indent chained method calls
-B, --break-chained-methods Break chained method calls across subsequent lines
-k, --keep-array-indentation Preserve array indentation
-x, --unescape-strings Decode printable characters encoded in xNN notation
-w, --wrap-line-length Wrap lines that exceed N characters [0]
-X, --e4x Pass E4X xml literals through untouched
--good-stuff Warm the cockles of Crockford's heart
-C, --comma-first Put commas at the beginning of new line instead of end
-O, --operator-position Set operator position (before-newline|after-newline|preserve-newline) [before-newline]
--indent-empty-lines Keep indentation on empty lines
--templating List of templating languages (auto,django,erb,handlebars,php,smarty) ["auto"] auto = none in JavaScript, all in html
// Programmatic accessvarbeautify_js=require('js-beautify');// also available under "js" exportvarbeautify_css=require('js-beautify').css;varbeautify_html=require('js-beautify').html;// All methods accept two arguments, the string to be beautified, and an options object.
CSS和HTML美化程序在范围上要简单得多,并且拥有的选项要少得多
CSS Beautifier Options:
-s, --indent-size Indentation size [4]
-c, --indent-char Indentation character [" "]
-t, --indent-with-tabs Indent with tabs, overrides -s and -c
-e, --eol Character(s) to use as line terminators. (default newline - "\\n")
-n, --end-with-newline End output with newline
-b, --brace-style [collapse|expand] ["collapse"]
-L, --selector-separator-newline Add a newline between multiple selectors
-N, --newline-between-rules Add a newline between CSS rules
--indent-empty-lines Keep indentation on empty lines
HTML Beautifier Options:
-s, --indent-size Indentation size [4]
-c, --indent-char Indentation character [" "]
-t, --indent-with-tabs Indent with tabs, overrides -s and -c
-e, --eol Character(s) to use as line terminators. (default newline - "\\n")
-n, --end-with-newline End output with newline
-p, --preserve-newlines Preserve existing line-breaks (--no-preserve-newlines disables)
-m, --max-preserve-newlines Maximum number of line-breaks to be preserved in one chunk [10]
-I, --indent-inner-html Indent <head> and <body> sections. Default is false.
-b, --brace-style [collapse-preserve-inline|collapse|expand|end-expand|none] ["collapse"]
-S, --indent-scripts [keep|separate|normal] ["normal"]
-w, --wrap-line-length Maximum characters per line (0 disables) [250]
-A, --wrap-attributes Wrap attributes to new lines [auto|force|force-aligned|force-expand-multiline|aligned-multiple|preserve|preserve-aligned] ["auto"]
-i, --wrap-attributes-indent-size Indent wrapped attributes to after N characters [indent-size] (ignored if wrap-attributes is "aligned")
-d, --inline List of tags to be considered inline tags
-U, --unformatted List of tags (defaults to inline) that should not be reformatted
-T, --content_unformatted List of tags (defaults to pre) whose content should not be reformatted
-E, --extra_liners List of tags (defaults to [head,body,/html] that should have an extra newline before them.
--editorconfig Use EditorConfig to set up the options
--indent_scripts Sets indent level inside script tags ("normal", "keep", "separate")
--unformatted_content_delimiter Keep text content together between this string [""]
--indent-empty-lines Keep indentation on empty lines
--templating List of templating languages (auto,none,django,erb,handlebars,php,smarty) ["auto"] auto = none in JavaScript, all in html
// Use ignore when the content is not parsable in the current language, JavaScript in this case.vara=1;/* beautify ignore:start */{Thisissomestrange{templatelanguage{usingopen-braces?/* beautify ignore:end */
保留指令
注意:此指令仅适用于HTML和JavaScript,不适用于CSS
这个preserve指令使美化器解析,然后保留一段代码的现有格式
美化后以下输入保持不变:
// Use preserve when the content is valid syntax in the current language, JavaScript in this case.// This will parse the code and preserve the existing formatting./* beautify preserve:start */{browserName: 'internet explorer',platform: 'Windows 7',version: '8'}/* beautify preserve:end */